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Checking System Rules Using System-Specific, 
Programmer-Written Compiler Extensions 


Dawson Engler, Benjamin Chelf, Andy Chou, and Seth Hallem* 
Computer Systems Laboratory 
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Stanford, CA 94305, U.S.A. 


Abstract 


Systems software such as OS kernels, embedded sys- 
tems, and libraries must obey many rules for both 
correctness and performance. Common examples in- 
clude “accesses to variable A must be guarded by 
lock B,” “system calls must check user pointers for 
validity before using them,” and “message handlers 
should free their buffers as quickly as possible to al- 
low greater parallelism.” Unfortunately, adherence 
to these rules is largely unchecked. 

This paper attacks this problem by showing how 
system implementors can use meta-level compilation 
(MC) to write simple, system-specific compiler exten- 
sions that automatically check their code for rule vio- 
lations. By melding domain-specific knowledge with 
the automatic machinery of compilers, MC brings the 
benefits of language-level checking and optimizing to 
the higher, “meta” level of the systems implemented 
in these languages. This paper demonstrates the ef- 
fectiveness of the MC approach by applying it to four 
complex, real systems: Linux, OpenBSD, the Xok 
exokernel, and the FLASH machine’s embedded soft- 
ware. MC extensions found roughly 500 errors in 
these systems and led to numerous kernel patches. 
Most extensions were less than a hundred lines of 
code and written by implementors who had a limited 
understanding of the systems checked. 


1 Introduction 


Systems software must obey many rules such as “check 
user permissions before modifying kernel data struc- 
tures,” “for speed, enforce mutual exclusion with spin 
locks rather than disabling interrupts,” and “message 
handlers must free their buffer before completing.” 


“This research was supported in part by DARPA contract 
MDA904-98-C-A933 and by a Terman Fellowship. 


Code that does not obey these rules can degrade per- 
formance or crash the system. 

There are several methods to find violations of 
system rules. A rigorous way is to build an abstract 
specification of the code and then use model check- 
ers [23, 32] or theorem provers/checkers [2, 11, 25] to 
check that the specification is internally consistent. 
When applicable, formal verification finds errors that 
are difficult to detect by other means. However, spec- 
ifications are difficult and costly to construct. Fur- 
ther, specifications do not necessarily mirror the code 
they abstract and, in practice, suffer from missing fea- 
tures and over-simplifications. While recent work has 
begun attacking these problems [6, 14], it is extremely 
rare for software to be verified. 

The most common method used to detect rule vio- 
lations is testing. Testing is simpler than verification. 
It also avoids the mirroring problems of formal veri- 
fication by working with actual code rather than an 
abstraction of it. However, testing is dynamic, which 
has numerous disadvantages. First, the number of 
execution paths typically grows exponentially with 
code size. Thorough, precise testing requires writing 
many test cases to exercise these paths and drive the 
system into error states. The effort required to cre- 
ate these tests, and the time it takes to run them, 
scales with the amount of code. As a result, real sys- 
tems have many paths that are rarely or never hit by 
testing and errors that manifest themselves only af- 
ter days of continuous execution. Further, finding the 
cause of a test failure can be difficult, especially when 
the effect is a delayed system crash. Finally, testing 
requires running the tested code, which can create 
significant practical problems. For example, testing 
all device drivers in an OS requires acquiring possibly 
hundreds or thousands of devices and understanding 
how to thoroughly exercise them. 

Another common method to detect rule violations 
is manual inspection. This method has the strength 
that it can consider all semantic levels and adapt to 
ad hoc coding conventions and system rules. Unfor- 
tunately, many systems have millions of lines of code 
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with deep, complex code paths. Reasoning about 
a single path can take minutes or sometimes, when 
dealing with concurrency, hours. Further, the relia- 
bility of manual inspection is erratic. 

These methods leave implementors in an unfor- 
tunate situation. Verification is impractical for most 
systems. Testing misses many cases and makes di- 
agnosis difficult. Manual inspection is unreliable and 
tedious. One possible alternative is to use static com- 
piler analysis to find rule violations. Unlike verifica- 
tion, compilers work with the code itself, removing 
the need to write and maintain a specification. Un- 
like testing, static analysis can examine all execution 
paths for errors, even in code that cannot be conve- 
niently executed. Further, a compiler analysis pass 
reduces the need to construct numerous test cases 
and scales from a single function to an entire system 
with little increase in manual effort. 

Compilers can be used to enforce systems rules 
because many rules have a straightforward mapping 
to program source. Rule violations can be found by 
checking when source operations do not make sense 
at an abstract level. For example, ordering rules such 
as “interrupts must be enabled after being disabled” 
reduce to observing the order of function calls or id- 
iomatic sequences of statements (in this case, a call 
to a disable interrupt function must be followed by a 
re-enable call). 

The main barrier to a compiler checking or opti- 
mizing at this level is that while it must have a pre- 
cise understanding of the semantics of its input code, 
it typically has no idea of the “meta” semantics of 
the software system this code constructs. Thus, it 
cannot check many properties inexpressible (or just 
not expressed) in terms of the underlying language’s 
type system. This leaves an unfortunate dichotomy. 
Implementors understand the semantics of the sys- 
tem operations they build and use but do not have 
the mechanisms to check or exploit these semantics 
automatically. Compilers have the machinery to do 
so, but their domain ignorance prevents them from 
exploiting it. 

This paper shows how to automatically check sys- 
tems rules using meta-level compilation (MC). MC 
attacks this problem by making it easy for implemen- 
tors to extend compilers with lightweight, system- 
specific checkers and optimizers. Because these ex- 
tensions can be written by system implementors them- 
selves, they can take into account the ad hoc (some- 
times bizarre) semantics of a system. Because they 
are compiler based, they also get the benefits of au- 
tomatic static analysis. 

In our MC system, implementors write extensions 
in a high-level state-machine language, metal. These 


extensions are dynamically linked into our extensible 
compiler, rg++, and applied down all flow paths in 
all functions in the program source input. They use 
language-based patterns to recognize operations that 
they care about. Then, when the input code matches 
these patterns, they detect rule violations by tran- 
sitioning between states that allow or disallow other 
operations. 

This paper’s primary contribution is its demon- 
stration that MC is a general, effective approach for 
finding system errors. Our most important results 
are: 


1. MC checkers find serious errors in complex, real 
systems code. We present a series of exten- 
sions that found roughly 500 errors in four sys- 
tems: the Linux 2.3.99 kernel, OpenBSD, the 
Xok exokernel [16], and the FLASH machine’s 
embedded cache controller code [20]. Many er- 
rors were the worst type of systems bugs: those 
that crash the system, but only after it has been 
running continuously for days. 


2. MC optimizers discover system-level opportu- 
nities that are difficult to find with manual in- 
spection. While the main focus of this paper 
is error checking, MC extensions can also be 
used for optimization. Section 8 describes three 
FLASH-specific, MC optimizers that found hun- 
dreds of system-level optimization opportuni- 
ties. 


3. MC extensions are simple. The extensions men- 
tioned above are typically less than a hundred 
lines of code. 


A practical result of our experience with MC is that 
the majority of our extensions were written by pro- 
grammers who had only a passing familiarity with the 
systems that they checked. Although writing code 
that obeys system rules can be quite difficult, these 
rules are easy to express. Thus, writing checkers for 
many of them is relatively straightforward. 

This paper is laid out as follows. Section 2 dis- 
cusses related work. Section 3 gives an overview of 
MC and the system we use to implement it. Sec- 
tion 4 applies the approach to the C assert macro 
and shows that even in such a limited domain, MC 
provides non-trivial benefits. Section 5 shows how 
to use MC to enforce ordering constraints such as 
checking that kernels verify user pointers before using 
them. Section 6 extends this to global, system-wide 
constraints. Section 7 is a more detailed case study in 
how we used MC to check Linux locking and interrupt 
disabling/re-enabling disciplines. Section 8 describes 
our FLASH optimizers, and Section 9 concludes. 
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2 Related Work 


We proposed the initial idea of MC in [9] and pro- 
vided a simple system, magik (based on the 1lcc ANSI 
C compiler [12]), for using it. While the original pa- 
per had many examples, it provided no experimental 
evaluation. This paper provides a more developed 
view of MC, a significantly easier-to-use and more 
powerful framework for building extensions, and an 
experimental demonstration of its effectiveness. Con- 
currently with this paper, we presented a detailed 
case study of applying MC to the FLASH system [4]. 
The 8 compiler extensions presented in that paper 
discovered 34 errors in FLASH code that could po- 
tentially crash the machine, such as message handlers 
that lost or double freed hardware message buffers 
and buffer race conditions. This paper’s main differ- 
ence is its demonstration that MC is a general tech- 
nique by applying it to a variety of systems. Because 
of this broader scope, it lacks the detail in [4], but 
finds roughly a factor of ten more errors. 

Below, we compare our work to efforts in high- 
level compilation, verification, and extensible com- 
pilers. 

Higher-level compilation. Many projects have 
hard wired application-level information in compil- 
ers. These projects include: compiler-directed man- 
agement of I/O [24]; the ERASER dynamic race de- 
tection checker [30]; ParaSoft’s Insure++ [19], which 
can check for Unix system call errors; the use of static 
analysis to check for security errors in privileged pro- 
grams [1]; and the GNU compilers’ -Wall option, 
which warns about dangerous functions and question- 
able programming practices. Related to the checkers 
in this paper, Microsoft has an internal tool for find- 
ing a fixed set of coding violations in Windows device 
drivers [27] such as errors in handling 64-bit code and 
missing user pointer validity checks. 

These projects use compiler support to analyze 
specific problems, whereas MC explicitly argues for 
the general use of compilers to check and optimize 
systems and provides an extensible framework for do- 
ing so. This extensibility enables detection of rule vi- 
olations that are impossible to find without system- 
specific knowledge. 

Systems for finding software errors. Most 
approaches to statically finding software errors cen- 
ter around either formal verification (as discussed in 
Section 1) or strong type checking. 

Verification uses stronger analysis than MC ex- 
tensions. However, MC extensions appear to be more 
generally effective. To the best of our knowledge, ver- 
ification papers tend to find a small number of errors 
(typically 0-2), whereas the MC checkers in this paper 


found hundreds. Verification’s lower bug counts seem 
largely due to the difficulty in writing specifications, 
which scales with code size. As a consequence, only 
small pieces of code are verified. In contrast, because 
MC operates directly on source code, it (like tradi- 
tional compiler analyses) applies as easily to millions 
of lines of code as it does to only a few. 

Two recent strong-typing systems are the extended 
static type checking (ESC) project [8] and Intrinsa’s 
PREfix [15]. Both of these systems use stronger anal- 
yses than our approach. However, they only check for 
a fixed set of low-level errors (e.g., buffer overruns and 
null pointer references). Their lack of extensibility 
means that, with the exception of ESC’s support for 
finding some class of race conditions, neither system 
can find the system-level errors that MC can detect. 

LCLint [10] statically checks programmer source 
annotations to detect coding errors and abstraction 
barrier violations. Like ESC and Intrinsa, LCLint 
is not extensible, which prevents it from finding the 
errors that MC can find. Further, the source an- 
notations that LCLint requires scale with code size, 
significantly increasing the manual effort needed to 
apply it. 

Extensible compilation. There have been a 
number of “open compiler” systems that allow pro- 
grammers to add analysis routines, usually modeled 
as extensions, that traverse the compiler’s abstract 
syntax tree. These include Lord’s ctool [22], which 
allows scheme extensions to walk over an abstract 
syntax tree for C, and Crew’s Prolog-based AST- 
LOG [7], also used for C. 

Lamping et al. [21] and Kiczales et al. [17] argue 
for pushing domain-specific information into compi- 
lation. They use meta-object protocols (MOPs) to 
allow programs to be augmented with a “meta” part 
that controls the base [17]. Such protocols are typi- 
cally dynamic and have fairly limited analysis abili- 
ties. Shigeru Chiba’s Open C++ [3] provides a static 
MOFP that allows users to extend the compilation pro- 
cess. 

The extensions in these systems are mainly lim- 
ited to syntax-based tree traversal or transformation 
and do not have data flow information. As a re- 
sult, they seem to be both less powerful than MC 
extensions and more difficult to use. : Our current, 
language-based approach is a dramatic improvement 
over our previous tree-based systems: extensions are 
2-4 times smaller, have less bugs, and handle more 
cases. Further, to the best of our knowledge, ctool, 
ASTLOG, and Open C++ provide no experimental re- 
sults, making it difficult to evaluate their effective- 
ness. 

At a lower-level, the ATOM object code modifi- 
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cation system [31] gives users the ability to modify 
object code in a clean, simple manner. By focusing 
on machine code, ATOM can be used in more situa- 
tions than MC, which requires source code. However, 
while dynamic testing schemes [13, 30] are well served 
by object-level modifications, it would be difficult to 
perform our static checks without the semantic infor- 
mation available in the compiler. 

Concurrently with our original work [9], Kiczales 
et al. [18] proposed “aspect oriented programming” 
(AOP) as a way of combining code that manages 
“aspects,” such as synchronization, with code that 
needs them. AOP has the advantage of being inte- 
grated within a traditional language framework. It 
has the disadvantage that aspects have more limited 
scope than MC extensions, which survey the entire 
system as well as check rules difficult to enforce with 
an AOP framework (e.g., preventing kernel code from 
using floating point). Further, because AOP requires 
source modifications, retro-fitting it on the systems 
we check would be non-trivial. 


3 Meta-level Compilation 


Many systems constraints describe legal orderings of 
operations or specific contexts in which these oper- 
ations can or cannot occur. Since the actions rele- 
vant to these rules are visible in program source, an 
MC compiler extension can check them by searching 
for the corresponding operations and verifying that 
they obey the given ordering and/or contextual re- 
strictions. Table 1 gives a representative set of rule 
“templates” that can be checked in this manner along 
with examples. Many system rules that roughly fol- 
low these templates can be checked automatically. 
For example, an MC extension to enforce the con- 
textual rule, “for speed, if a shared variable is not 
modified, protect it with read locks,” can search for 
each write-lock critical section, examine all variable 
uses, and, if no stores occur to protected variables, 
demote the locks or suggest alternative usage. 


3.1 Language Overview 


In our implementation of MC, compiler extensions 
are written in a high-level, state-machine language, 
metal [5]. These extensions are dynamically linked 
into our extensible compiler, 29+ +(based on the GNU 
g++ compiler). After zg++ translates each input func- 
tion into its internal representation, the extensions 
are applied down every possible execution path in 
that function. The state machine part of the lan- 
guage can be viewed as syntactically similar to a 
“yacc” specification. Typically, SMs use patterns 


{ include "linux-includes.h" } 
sm check_interrupts { 

// Variables 

// used in patterns 

decl { unsigned } flags; 


// Patterns 
// to specify enable/disable functions. 
pat enable = { sti(); } 

| { restore_flags(flags); } ; 
pat disable = { cli(); }; 


// States 
// The first state is the initial state. 
is_enabled: disable ==> is_disabled 

| enable ==> { err("double enable"); } 


is_disabled: enable ==> is_enabled 

| disable ==> { err("double disable"); } 

// Special pattern that matches when the SM 

// hits the end of any path in this state. 

| $end_of_path$ ==> 

{ err("exiting w/intr disabled!"); } 

} 
Figure 1: A metal SM to detect (1) when interrupts 
disabled using cli are not re-enabled using either sti 
or restore_flags and (2) duplicate enable/disable 
calls. 


to search for interesting source code features, which, 
when matched, cause transitions between states. Pat- 
terns are written in an extended version of the base 
language (C++), and can match almost arbitrary 
language constructs such as declarations, expressions, 
and statements. Expressing patterns in the base lan- 
guage makes them both flexible and easy to use, since 
they closely mirror the source constructs they de- 
scribe. 

Figure 1 shows a stripped-down metal extension 
for Linux that checks that disabled interrupts are re- 
enabled or restored to their initial state upon exit- 
ing a function. Interrupts are disabled by calling the 
cli() procedure; they are enabled by calling sti() 
or restored using restore flags (flags), where the 
flags variable holds the interrupt state before the 
cli() was issued. Conceptually, the extension finds 
violations by checking that each call to disable in- 
terrupts has a matching enable call on all outgoing 
paths. As refinements, the extension warns of du- 
plicate calls to these functions or non-sequitur calls 
(e.g., re-enabling without disabling). A more com- 
plete version of this checker, described in Section 7, 
found 82 errors in Linux code. 

The extension tracks the interrupt status using 
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Rule template 
“Never/always 
do X” 


“Do X 
than Y” 
“Always do X 
before/after Y” 


rather 


“Never do X be- 
fore/after Y” 


“In situation X, 
do (not do) Y” 


“In situation X, 
do Y rather than 
Zz” 


| “Do not use floating point in the kernel.” (§ 4.3) “Do not allocate large variables 
on the 6K byte kernel stack.” (§ 4.3) “Do not send more than two messages per 
virtual network lane.” “Allocate as much storage as an object needs.” (§ 5.2) 
“Use memory mapped I/O rather than copying.” “Avoid globally disabling 
interrupts.” 
“Check user pointers before using them in the kernel.” (§ 5.1) “Handle operations 
that can fail (e.g., memory, disk block, virtual interrupt allocation).” (§ 5.2) 
“Re-enable interrupts after disabling them.” (§ 7) “Release locks after acquiring 
them.” (§ 7) “Check user permissions before modifying kernel data structures.” 
“Do not acquire lock A before B.” “Do not use memory that has been freed.” (§ 5.2) 
“Do not (deallocate an object, acquire/release a lock) twice.” (§ 5.2 § 7) “Do not 
increment a module’s reference count after calling a function that can sleep.” (§ 6.3) 
“Protect all variable mutations with write locks.” “If a system call fails, reverse all 
side-effect operations (deallocate memory, disk blocks, pages, unincrement reference 
counters).” (§ 5.2 § 6.3) “To avoid deadlock, while interrupts are disabled, do not 
call functions that can sleep.” (§ 6.2) 
“If a variable is not modified, protect it with read locks.” “If code does not share 
data with interrupt handlers, then use spin locks rather than the more expensive 

| interrupt disabling.” “To save an instruction when setting a message opcode, xor 


| in the new and old opcode rather than using assignment.” (§ 8) 





Table 1: Sample system rule templates and examples. Checkers for the rule are denoted by section number. 


/* From Linux 2.3.99 drivers/block/raid5.c */ 
static struct buffer_head * 
get_free_buffer(struct stripe_head *sh, 
int b_size) { 
struct buffer_head *bh; 
unsigned long flags; 


save_flags (flags) ; 

cli(); 

if ((bh = sh->buffer_pool) == NULL) 

return NULL; 

sh->buffer_pool = bh->b_next; 

bh->b_size = b_size; 

restore_flags (flags) ; 

return bh; 
} 
Figure 2: Example code from the Linux 2.3.99 Raid 
5 driver illustrating a real error caught by the exten- 
sion. The SM will be applied down both paths in 
this function. The path ending with a return of bh is 
well formed and will be accepted. The path ending 
with the return of NULL is not, and will get a warning 
about not re-enabling interrupts. 


two states, is-enabled and is_disabled. SMs start 
in the state mentioned in the first transition defi- 
nition (here, is_enabled). Each state has a set of 
rules specifying a pattern, an optional state transi- 
tion, and an optional action. Actions can be arbi- 
trary C++ code. For a given state, metal checks 
pattern rules in lexical order. If any code matches 
the specified patterns, metal processes this matching 
code, sets the state to the new state (the token af- 
ter the ==> operator), and executes the action. In 
this example, is-enabled has two rules. The first, 
actionless rule searches for functions that disable in- 
terrupts using the disable pattern and transitions to 
the is_disabled state. The second rule searches for 
calls to functions that enable interrupts and gives a 
warning. Since it does not specify a transition state, 
the SM remains in the is_enabled state. If no pat- 
tern matches, the SM remains in the same state and 
continues down the current code path. The flags 
variable is a wild card that matches any expression 
of type unsigned. When it is matched, metal will put 
the matching expression in flag, which can then be 
used in an action. We use this feature in an extension 
discussed in Section 4. 

To run this SM, it is first compiled with mcc, 
our metal compiler. It is then dynamically linked 
into g++ using a compile-time, command-line flag. 
When run on the Linux “RAID 5” driver buffer allo- 
cation code in Figure 2, it is pushed down both paths 
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in the function. The first path returns NULL when 
the buffer pool is empty (i.e., when the if statement 
fails); the other returns a buffer on successful alloca- 
tion. The first path fails to re-enable interrupts, and 
this error 1 is caught and reported by the extension. 
One way to get a feel for how costly it would be to 
manually perform the check our SM does automati- 
cally is that even when we showed an experienced 
Linux programmer the exact error in Figure 2, it took 
him over 20 minutes to examine a single call chain out 
of the nine leading to this function. Performing sim- 
ilar analysis for the other hundreds of thousands of 
lines of driver and kernel code seems impractical. 


3.2 Practical issues 


Metal SMs can specify whether they should be ap- 
plied either down all paths (i.e., flow-sensitive) or lin- 
early through the code (i.e., flow-insensitive). A sim- 
ple implementation of flow-sensitive SMs could take 
exponential time in some cases. We use aggressive 
caching to prune redundant code paths where SM 
instances follow paths that join (e.g., if statements, 
loops) and reach the join point in the same state. Our 
caching is based on the fact that a deterministic SM 
applied to the same input in the same internal state 
must compute the same result. The system represents 
the state of an SM as a vector holding the value of 
its variables. For each node in the input flow-graph, 
it records the set of states in which it has been vis- 
ited. If an SM arrives at a node in the same state as 
a previous instance, the system prunes it. 

While caching was originally motivated by speed, 
perhaps its most important feature is that it provides 
a clean framework for computing loop “fixed points” 
transparently. When an SM has exhausted the set of 
states reachable within the loop (typically with two 
iterations), metal automatically stops traversing the 
loop. This fixed-point behavior depends on the SM 
having a finite (and small) number of states. We do 
not currently enforce this restriction. 

The current rg++ system does not integrate global 
analysis with the SM framework. Instead, it pro- 
vides a library of routines to emit client-annotated 
flow graphs to a file, which can then be read and tra- 
versed. Section 6 gives an example of how we used 
this framework to compute the transitive closure of 
all possibly-sleeping functions. We are integrating 
these two passes. 


lAmusingly, this interrupt disable bug would be masked 
by an immediate kernel segmentation fault since callers of 
this function dereference the returned pointer without checking 
whether the allocation succeeded. 
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3.3 Caveats 


Most of our extensions are checkers rather than veri- 
fiers: they find bugs, but do not guarantee their ab- 
sence. For example, their ignorance of aliases pre- 
vents them from asserting that many actions “can- 
not happen.” In general, many compiler problems 
are undecidable, which places hard limits on the ef- 
fectiveness of static analysis. Despite these limita- 
tions, as our results show, MC extensions are quite 
effective. We are currently investigating how to turn 
some classes of checkers into verifiers. 

We mainly check systems we did not build. As 
a result, some rule violations we found might not be 
bugs because the code could use a non-obvious system 
feature that works correctly in a specific situation. 
We countered this danger in two ways. First, we sent 
our error logs to the system implementors of Linux, 
FLASH, and Xok for confirmation. However, while 
we got feedback on many errors, their sheer number 
meant that many did not receive careful examina- 
tion. Second, we conservatively did not count many 
cases that were difficult to reason about. While our 
results may still contain mis-diagnoses, we would be 
surprised if these caused more than a few percentage 
points difference. 

Several of our checkers produce a number of false 
positives (in the worst case, in Section 7, up to three 
per error). These are due to the limitations of both 
static analysis and our checkers, which primarily use 
simple local analyses. Usually these numbers can 
be reduced significantly by adding some amount of 
global analysis or system-specific knowledge. In al- 
most all cases, each false positive can be suppressed 
with a single source annotation. Extensions can pro- 
vide annotations by supplying a set of reserved func- 
tions that clients call to indicate that a specific source- 
level warning should be suppressed. As a refinement, 
checkers can detect bogus or erroneous annotations 
by warning when they are not needed. 

Basing our MC system on a C++ compiler has 
caused difficulties when applying it to Linux and Xok. 
These systems aggressively assume C’s more relaxed 
type system and use GNU extensions that are illegal 
in g++. Thus, while in theory MC can be applied 
to a system transparently, we had to modify Xok and 
Linux to remove GNU C constructs that are illegal in 
C++. We also modified the g++ front-end to relax its 
type checking. To avoid this labor for other systems, 
we are currently finishing a gcc-based implementa- 
tion of zg++. More generally, since the metal lan- 
guage has been designed to be shielded from both the 
underlying language and compiler, we plan to port it 
other languages and other compilers. 
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The remainder of this paper describes the exten- 
sions we implemented using metal and zg++ and the 
results of applying the concept of meta-level compi- 
lation to real systems. 


4 A Simple Meta-language 


The C assert macro takes a single condition as its ar- 
gument, checks this condition at runtime, and aborts 
execution if the condition is false. This macro defines 
one of the simplest meta-languages possible: it has 
no state and a single operation. This section shows 
how MC can help even such simple interfaces by pre- 
senting two extensions that check the following two 
assertion invariants: 


1. Assertions should not have non-debugging side- 
effects. Frequently, assert is used only for de- 
velopment and turned off in production code. If 
an assert condition has important side-effects, 
these will disappear and the program will be- 
have incorrectly. 


2. Assertion conditions should not fail. Program- 
mers use assertions to check for conditions that 
should not happen. Any code path leading to 
an assertion that causes its boolean expression 
to fail is probably a bug. 


4.1 Checking assertion side-effects 


Figure 3 presents a metal checker that inspects as- 
sertion expressions for side-effects. The directive, 
“flow_insensitive,” tells metal to apply the exten- 
sion linearly over input functions rather than down 
all paths, improving speed and error reporting (since 
there will be exactly one message per violation). The 
SM begins in the initial state, start, and uses the 
literal metal pattern “{assert (expr) ;}” to find all 
assert uses. 2 On each match, metal stores the 
assert expression in the variable, expr. It then 
runs start’s action, which uses the metal procedure 
mgk_expr-_recurse to recursively apply the SM to 
the expression in expr in the in_assert state. The 
in_assert state uses metal’s generic type “any” to 
match assignments, and pointer increments and decre- 
ments of any type. Note that the assignment operator 
will also detect uses of C’s infix operators (e.g., +=, 
-=, etc.). The extension matches any function call 
with any set of arguments using the extended types 

2Since patterns can match nearly arbitrary C code, it does 


not matter if assert is a function or a macro; we have modified 
the pre-processor to ignore line and file directives. 


{ #include <assert.h> } 

// Apply SM ignoring control flow 

sm Assert flow_insensitive { 
// Match expressions of "any" type 
decl { any } expr, x, y, 2; 
// Used in combination to match all 
// calls with any arguments 
decl { any_call } any_fcall; 
decl { any_args } args; 


// Find all assert calls. Then apply 
// SM to "expr" in state "in_assert." 
start: { assert(expr); } ==> 
{ mgk_expr_recurse(expr, in_assert); } ; 
// Find all side-effects 
in_assert: 
// Match all calls 
{ any_fcall(args) } ==> 
{ err("function call"); } 
// Match any assignment (including 
// the operators +=, -=, etc.) 
)}{x=y} ==> { err("assignment"); } 
// Match all increments and decrements 
// --z and ++z ommited for brevity 
| { z++ } ==> { err("post-increment"); } 
| { z-- } ==> { err("post-decrement"); } ; 
} 
Figure 3: A metal SM that warns of side-effects in 
assert uses. 


any_call and any_args in combination. To assist de- 
velopers in writing extensions, metal provides a set of 
generic types for matching different classes of types 
(e.g., scalars, pointers, floats), and different program- 
ming constructs (e.g., case labels, indirections). 

When applied to Xok’s ExOS library operating 
system, this 25 line extension found 16 violations 
in 199 assert uses. Two were false positives trig- 
gered by debugging functions. These could be sup- 
pressed by wrapping such calls in a differently named, 
unchecked assertion macro. The remaining fourteen 
cases were errors in crucial system code that would 
function incorrectly if the assertion was removed. The 
underlying cause of these errors was assert’s use as 
shorthand for checking the result of possibly-failing 
operations such as insertion of page table entries and 
deallocation of shared memory regions. A typical ex- 
ample is the following snippet from the ExOS “mmap” 
code to insert a page table entry: 


/* libexos/os/mmap.c:mmap_fault_handler:410 +/ 


assert (_exos_self_insert_pte(0, PG_P| 
PG_U|PG_W, PGROUNDDOWN(va), 0, NULL) == 0); 


The effect of removing the assert condition (and hence 
these calls) would be mysterious virtual memory er- 
rors. 
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4.2 Checking assertions statically 


Assertions specify conditions that the programmer 
believes must hold. Without MC, compilers are obliv- 
ious to this fact, so assert checks can only occur dy- 
namically. With MC, it is possible to find errors by 
evaluating these conditions statically, thereby quickly 
and precisely finding errors. 

We wrote such an extension on top of zg++. Ata 
high level, it uses rg++’s dataflow routines to track 
the values of scalar variables. At each assert use, it 
evaluates the assertion expression against these known 
set of values. If the expression could fail, it emits a 
warning. Currently, zg++ only performs primitive 
analysis that tracks the set of constant assignments 
to scalar variables on a given path. The set of possi- 
ble values for a variable is then just the union of con- 
stant assignments to that variable before it is used. 
If any non-constant assignments occur, the value is 
considered “unknown.” Returning the set of possi- 
ble values allows the effectiveness of the checker to 
transparently increase as our analysis in 2g++ be- 
comes more powerful. As a practical refinement, we 
eliminate a large class of false positives by ignoring 
assertions of the constant “0” (which always fails) 
since this is an idiomatic method for programmers to 
terminate execution in “impossible” situations. 

When applied to the FLASH cache coherence code 
(discussed more in Section 8) the 100 line extension 
found five errors that could have crashed the system. 
These errors underscore the value of static evalua- 
tion, since they were in code that had been heavily 
tested for over five years. They had been missed be- 
cause the length and complexity of typical FLASH 
code paths caused them to only occur sporadically. 
This complexity also makes manual detection of er- 
rors difficult. On one path, the assignment and the 
assertion that it violated were 300 lines apart and 
separated by 20 if statements, 6 else clauses, and 10 
conditional compilation directives. Another case beat 
this by having 21 if statements, 4 else clauses, and 29 
conditional compilations! Even given the exact situ- 
ation that leads to the error, inspecting such paths is 
mind-numbing. 


4.3. Discussion 


Library implementations cannot inspect the context 
in which they are used or how they are invoked. MC 
can be used to attack these blindnesses. Our first ex- 
tension used MC to to detect illegal actions in assert 
uses, something that an assert implementation can- 
not otherwise do either dynamically or statically. Our 
second extension used context knowledge to push dy- 
namically evaluated conditions to compile time. A 


4th Symposium on Operating Systems Design and Implementation 


similar approach can be used to make certain dy- 
namic error checks static or to improve performance 
by allowing implementations to specialize themselves 
to a given context, such as a memory allocator that 
generates specialized inline allocations for constant 
size allocation requests. 

The restriction on side-effects in assertion condi- 
tions is a miniature example of a more general pat- 
tern of “language subsetting,” where systems impose 
an execution context more restrictive than the base 
language in which code is written. We have built two 
other extensions that enforce system-specific execu- 
tion restrictions. The first warns when kernel code 
uses floating point. It found one case where a Linux 
graphics driver assumes that floating point calcula- 
tions will be evaluated at compile time. Using a 
compiler other than gcc or lower optimization lev- 
els could violate this assumption. The second checks 
for stack overflow. It found 10 places where Linux 
code allocated variables larger than 3K on the 6K 
kernel stack, and numerous 1K or larger allocations. 
Most of these led to patches by kernel maintainers. 
It also found a similar case in Xok where an inno- 
cent looking stack-allocated structure turned out to 
be over 8K bytes. 

In addition to checking, systems can use restric- 
tion checkers for optimization by detecting when an 
application’s actions are more limited than the gen- 
eral case. For example, a threads package can use 
smaller stack sizes than the default if it can derive an 
upper bound on stack usage. 


5 Temporal Orderings 


Many system operations must (or must not) happen 
in sequence. Sequencing rules are well-suited for com- 
piler checking since sequences are frequently encoded 
as literal procedure calls in code. This allows a metal 
extension to find violations by searching for opera- 
tions and transitioning to states that allow, disallow, 
or require other operations. This section discusses 
two such extensions. The first enforces an “X be- 
fore Y” rule that system calls properly check applica- 
tion pointers passed to them for validity before using 
them. The second checks that code obeys a set of or- 
dering rules for memory allocation and deallocation. 


5.1 Checking copyin/copyout 


Most operating systems guard against application cor- 
ruption of kernel memory by, in part, using special 
routines to check system call input pointers and to 
move data between user and kernel space. We present 
an MC extension that finds errors in such code by 
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finding paths where an application pointer is used be- 
fore passing through such routines. At each system 
call definition, the extension uses a special metal pat- 
tern to find every pointer parameter, which it binds 
to a tainted state. (The use of per-variable state 
differs from the previous checkers that used a single, 
global state per path.) The only legal operations on 
a tainted variable are being (1) killed by an assign- 
ment or (2) passed as an argument to functions ex- 
pecting tainted inputs (e.g, data movement routines 
or output functions such as kprintf). All other uses 
will be signaled as an error. 

We tailored a version of this checker for the Xok 
exokernel code. It detects which procedures are sys- 
tem calls using the exokernel naming convention that 
such routine names begin with the prefix “sys_.” As 
a refinement, the checker warns when any non-system- 
call routines use “paranoid” user-data routines. It 
examined 187 distinct user pointers in the exokernel 
proper and device code and found 18 errors. A typical 
error is this command to issue disk requests: 

/* from sys/kern/disk.c */ 
int sys_disk_request (u_int sn, struct Xn_name 
*xn_user, struct buf *reqbp, u_int k) { 


/* bypass for direct scsi commands */ 
if (reqbp->b_flags & B_SCSICMD) 
return sys_disk_scsicmd (sn, k, reqbp) ; 


Here, the pointer, reqbp, is passed in from user space 
and dereferenced in the if statement without being 
checked. 

This extension also signalled 15 false positives. 
Four of these were due to a stylized use where non- 
null pointers were verified using standard routines, 
but null ones were allowed through (they would be 
handled correctly by lower levels). Three others were 
due to kernel backdoors used to let system calls call 
other system calls with unchecked parameters. The 
remaining were due to the checker’s lack of global 
analysis and its disallowing of tainted variable copies. 


5.2 Checking memory management 


Most kernel code uses memory managers based loosely 
on the C procedures malloc and free. We present 
an extension that checks four common rules: 


1. Since memory allocation can fail, kernel code 
must check whether the returned pointer is valid 
(ie., not null) before using it. 


2. Memory cannot be used after it has been freed. 


3. Paths that allocatememory and then abort with 
an error should typically deallocate this mem- 
ory before returning. 


Linux OpenBSD 
Violation a ee a — 


No check 


Error leak 
Use after Free 
Underflow 





Table 2: Error counts for Linux and OpenBSD. The 
checker was applied 4268 times in Linux and 464 
times in OpenBSD. 


4. The size of allocated memory cannot be less 
than the size of the object the assigned pointer 
holds. 


Figure 4 shows a stripped-down extension that 
checks these rules. For space, the size check and most 
error reporting code is omitted. This extension, like 
the previous one, associates each variable with a state 
encoding what operations are legal on it. Pointers to 
allocated storage can be in exactly one of four states: 
unknown, null, not.null, or freed. A variable is 
bound to the unknown state at every allocation site. 
When an unknown variable is compared to null (e.g., 
in C, “O”) the extension sets the variable’s state on 
the true (null) path to null and on the false (non- 
null) path to notnull. When the variable is com- 
pared to non-null, these two cases are reversed. The 
two initial patterns recognize C’s check-and-compare 
allocation idiom and combine these transitions with 
the initial variable binding. Pointers passed to free 
transition to the freed state. As a minor refinement, 
when variables are overwritten, the extension stops 
following them by transitioning to the special metal 
state, stop. 

The checker only allows dereferences of not_null 
pointers. This restriction catches instances when mem- 
ory is used before being checked, on null paths, or 
after being freed. It catches double-free errors by 
warning when freed pointers are passed to free. It 
catches cases when error paths do not free allocated 
memory by warning when any non_nu11 or unchecked 
variable reaches a return of a negative integer, which 
idiomatically signals an error path. 

The full version of the checker is 60 lines of code. 
We get a lot for so little: the extension implements a 
flow-sensitive compiler analysis pass that checks for 
rules on all paths and takes into consideration the 
observations furnished by passing through condition- 
als. As Table 2 shows, the extension found 132 errors 
in Linux and 51 errors in OpenBSD. It turned up 
61 and 3 false positives respectively, most due to not 
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handling variable copies, or not detecting when allo- 
cated memory would be freed by a cleanup routine. 

The most common error was not checking the 
result of memory allocation: 79 cases in Linux, 49 
in OpenBSD. In Linux, the single largest source of 
these errors was an allocation macro, CODA-ALLOC, 
which was widely used throughout the Coda file sys- 
tem code. It contains the unfortunate code: 


/* include/linux/coda_linux.h:CODA_ALLOC +/ 


ptr = (cast) vmalloc((unsigned long) size); 
if (ptr == 0) 
printk("kernel malloc returns 0 at %s:dd\n", 


at RITES2 es 
memset( ptr, 0, size ); 


LINE__) ; 


While this code prints a helpful message on every 
failed allocation, the initialization using memset will 
immediately cause a kernel segmentation fault. 

The next most common error was not freeing mem- 
ory on error paths (44 in Linux, 3 in OpenBSD). A 
typical not-freeing error is given in Figure 5. An id- 
iomatic mistake was to have many exit points from a 
function, but forgetting to free the memory at all of 
these points. 

The seven use-after-freeing errors could cause non- 
deterministic bugs if another thread re-allocated the 
freed memory. The most common case was five cut- 
and-paste uses of the code: 

/* drivers/isdn/pcbit:pcbit_init_dev */ 
kfree (dev) ; 
iounmap((unsigned char*)dev->sh_mem) ; 
release_mem_region(dev->ph_mem, 4096) ; 
Here, the memory pointed to by dev is freed and then 
immediately used in two subsequent function calls. 

Additionally, the checker discovered two under- 
allocation errors. These were particularly dangerous, 
since they could cause memory corruption whenever 
a routine is used, rather than only failing under high 
load. One was caused by an apparent typo where 
the size of the memory needed for a structure of type 
struct atmmpoa_qos (92 bytes) was computed us- 
ing the size of a structure of type struct atm_qos 
(84 bytes): 

/* net/atm/mpc.c:169:atm_mpoa_add_qos */ 
struct atm_mpoa_qos *entry; 


entry = kmalloc(sizeof (struct atm_qos), 

GFP_KERNEL) ; 
The other error reversed kmalloc’s size and inter- 
rupt level arguments, specifying that 7 (the value of 
GFP_KERNEL) bytes of storage to be allocated instead 
of 16. Currently, both errors are harmless, since the 
kernel uses a power-of-two memory allocator with a 
minimum allocation unit of 32 bytes. However, they 
are latent time bombs if a more space efficient allo- 
cator is ever used. 
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sm null_checker { 


} 


decl { scalar } sz; // match any scalar 
decl { const int } retv; // match const ints 
decl { any_ptr } vi; // match any ptr 

// ?state’ specifies ’v’ will have a state 
state decl { any_ptr } v; 


// Associate allocated memory with unknown 
// state until compared to null. 
start, v.all: 
// set v’s state on true path to "null", 
// on false path to "not_null" 
{ ((v = (any)malloc(sz)) == 0) } 
==> true=v.null, false=v.not_null 
// vice versa 
| { (Cv = (any)malloc(sz)) != 0) } 
==> true=v.not_null, false=v.null 
// unknown state until observed. 
| { v = (any)malloc(sz) } ==> v.unknown; 


// Allow comparisions on variables in 
// states "unknown", "null", and "not_null." 
v.unknown, v.null, v.not_null: 


{ (v == 0) } == 

true = v.null, false = v.not_null 
| { @ != 0) } == 

true = v.not_null, false = v.null; 


// Catch error path leaks by warning when 
// a non-null, non-freed variable gets to a 
// return of a negative integer. 
v.unknown, v.not_null: { return retv; } ==> 
{ if(mgk_int_cst(retv) < 0) 
err("Error path leak!"); }; 


// No dereferences of null or unknown ptrs. 
v.null, v.unknown: { *(any *)v } == 
{ err("Using ptr illegally!"); }; 


// Allow free of all non-freed variables. 
v.unknown, v.null, v.not_null: 
{ free(v); } ==> v.freed; 


// Check for double free and use after free. 
v.freed: 
{ free(v) } ==> { err("Dup free!"); } 
| { v } ==> { err("Use-after-free!"); }; 


// Overwriting v’s value kills its state 
v.all: { v = vi } ==> v.ok; 


Figure 4: Metal extension that checks that allocated 
memory is (1) checked before use, (2) not used after 
a free, (3) not double freed, and (4) always freed on 
error paths (those returning a negative integer). 
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/* from drivers/char/tea6300.c */ 
static int tea6300_attach(...) { 


client = kmalloc(sizeof *client ,GFP_KERNEL) ; 
if (!client) 
return -ENOMEM; 


tea = kmalloc (sizeof *tea, GFP_KERNEL) ; 
if ('!tea) 
return —ENOMEM; 


MOD_INC_USE_COUNT ; 


} 
Figure 5: Code with two errors: (1) not freeing mem- 
ory (client) on an error path and (2) (discussed 
in Section 6) calling MOD_-INC_USE_COUNT after poten- 
tially blocking memory allocation calls. 


While these checks focus on raw byte memory 
management, the general extension template can be 
retrofitted to check similar rules for other, higher- 
level objects. A modified version of this extension 
found 15 probable errors in Linux “IRQ” allocation 
code where allocations were not checked for errors, 
and IRQ’s were not deallocated on error paths. 


6 Enforcing Rules Globally 


The extensions described thus far have been imple- 
mented as local analyses. However, many systems 
rules are context dependent and apply globally across 
functions in a given call chain. This section presents 
two extensions that use rg++’s global analysis frame- 
work to check the following Linux rules: 


1. Kernel code cannot call blocking functions with 
interrupts disabled or while holding a spin lock. 
Violating this rule can lead to deadlock [28]. 


2. A dynamically loaded kernel module cannot call 
blocking functions until the module’s reference 
count has been properly set. Violating this 
rule leads to a race condition where the module 
could be unloaded while still in use [26]. 


We first describe a global analysis pass that computes 
a transitive closure of all potentially blocking rou- 
tines. Then, we discuss how the two extensions use 
this result. 


6.1 Computing blocking routines 


We build a list of possibly blocking functions in two 
passes. The first, local pass, is a metal extension that 


False Pos 








Local | Global 
Interrupts | 18 
21 


Spin Lock 

Table 3: Results for checking if kernel routines block 
(1) with interrupts disabled (“Interrupts”), (2) while 
holding a spin lock (“Spin Lock”), or (3) in a way 
that causes a module race (“Module”). We divide er- 
rors into whether they needed local or global analysis. 
Local errors were due to direct calls to blocking func- 
tions; global errors reached a blocking routine via a 
multi-level call chain. The global analysis results for 
Module are marked as approximate since they have 
not been manually confirmed. 





Module 


traverses over every kernel routine, marking it if it 
calls functions known to potentially block. In Linux, 
blocking functions are primarily (1) kernel memory 
allocators called without the GFP_ATOMIC flag (which 
specifies not to sleep when the request cannot be ful- 
filled) or (2) routines to move data to or from user 
space (these block on a page fault). After process- 
ing each routine, the extension calls zg++ support 
routines to emit the routine’s flow graph to a file. 
The flow graph contains (1) the routine’s annota- 
tion (if any) and (2) all procedures the routine calls. 
After the entire kernel has been processed, each in- 
put source file will have a corresponding emitted flow 
graph file. The second, global pass, uses tg++ rou- 
tines to link together all these files into a global call 
graph for the entire kernel. The global pass then 
uses Zg++ routines to perform a depth first traversal 
over this call graph calculating which routines have 
any path to a potentially blocking function. The out- 
put of this pass is a text file containing the names of 
all functions that could ever call a blocking function. 
Running the global analysis on the Linux kernel gives 
roughly 3000 functions that could potentially sleep. 


6.2 Checking for blocking deadlock 


Linux, like many OSes, uses a combination of inter- 
rupt disabling and spin locks for mutual exclusion. 
Interrupt disabling imposes an implicit rule: a thread 
running with interrupts disabled cannot block, since 
if it was the last runnable thread, the system will 
deadlock. Similarly, because of the implementation of 
Linux kernel thread scheduling, threads holding spin 
locks cannot block. Doing so causes deadlock when 
a sleeping thread holds a spin lock that a thread on 
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11 


12 


the same CPU is trying to acquire. 

Our metal extension checks both rules by assum- 
ing each routine starts in a “clean” state with inter- 
rupts enabled and no locks held. As it traverses each 
code path, if it hits a statement that disables inter- 
rupts, it goes to a disabled state; an enable interrupt 
call returns it to the original state. Similarly, if it hits 
a function that acquires a spin lock, it traverses to a 
locked state; an unlock call returns it to the clean 
state. While in either of these states (or their compo- 
sition), the extension examines all function calls and 
reports an error if the call is to a function in the list 
of potentially blocking routines. 

Despite the simplicity of these rules, real code vi- 
olates it in numerous places. The extension found 
123 errors in Linux. Of those errors, 79 could lead 
to deadlock. The remaining 44 were calls to kmalloc 
with interrupts disabled. Possibly motivated by the 
frequency of this error, the kmalloc code checks if it 
is called with interrupts disabled, and, if so, it prints 
a warning and re-enables interrupts. In situations 
where interrupt disabling was used for synchroniza- 
tion, this leads to race conditions. The following code 
snippet is representative of a typical error (the mis- 
take has been annotated in the source but not fixed): 


/* drivers/sound/midibuf.c */ 
save_flags (flags) ; 
cliQ); 


while (c < count) 


for (i = 0; i <n; itt) 
/* BROKE BROKE-CANT DO THIS WITH CLI!! +*/ 
copy_from_user((char *)&tmp_data, 

&(buf) [c),1); 
QUEUE_BYTE(midi_out_buf [dev], tmp_data) ; 
ct+; 

} 
restore_flags (flags) ; 


The call to copy_from_user can implicitly sleep, but 
is called after interrupts have been disabled with the 
call to cli. 

The local errors seem to be caused by driver im- 
plementors not having a clear picture of either (1) 
the rules they have to follow and (2) that user data 
movement routines can block. The global errors seem 
to be caused by the fact that it is often hard to tell 
if a function can potentially block without tediously 
tracing through several function calls in different files, 
or without a considerable amount of a priori Linux 
kernel knowledge. 

The checker produced eight false positives. Six 
were because the global calculation of blocking func- 
tions does not check if a called function would re- 
enable interrupts before calling a blocking function. 
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Two others were caused by name conflicts where a 
file defined and called a function with the same name 
as a blocking function. 

The approach of this section also applies to other 
operating systems. Another implementor used our 
system to write an extension for the OpenBSD sys- 
tem that checked if interrupt handling code called a 
blocking operation. He found one bug where an in- 
terrupt handler could call a page allocation routine 
that in turn called a blocking memory allocator [29]. 


6.3 Checking module reference counts 


Linux allows kernel subsystems to be dynamically 
loaded and unloaded. Modules have a reference count 
tracking the number of kernelsubsystems using them. 
Modules increment this count during loading (using 
MOD_INC_USE_COUNT) and decrement it during unload- 
ing (using MOD_DEC_USE_COUNT). The kernel can un- 
load modules with a zero reference count at any time. 
A module must protect against being unloaded while 
sleeping by incrementing its reference count before 
calling a blocking function. Similarly, during unload- 
ing, it cannot block after decrementing its count. Fi- 
nally, if the module aborts installation after incre- 
menting its reference count, it must decrement the 
count to restore it to its original value. 

Our extension checks for load race conditions by 
tracking if a potentially blocking function has been 
called and flagging subsequent MOD_INCs. Conversely, 
it checks for unload race conditions by tracking if a 
MOD_DEC has been performed and flagging subsequent 
calls to potentially blocking functions. It finds dan- 
gling references by emitting an error when a MOD_INC 
has not been reversed along a path that returns a neg- 
ative integer (which idiomatically signals an error). 
As Table 3 shows, a local version of the extension 
that did not use the global list of blocking functions 
found 22 rule violations, whereas the global version 
found 53 cases (we have not yet confirmed the global 
errors). 


7 Linux Mutual Exclusion 


The complexity of dealing with concurrency leads 
most of the Linux kernel and its device drivers to fol- 
low a localized strategy where critical sections begin 
and end within the same function body. Despite this 
stylized use, the size of the code and implementors’ 
imperfect understanding leads to errors. We wrote an 
extended version of the interrupt checker described in 
Section 3 to check that each kernel function conforms 
to the following conditions: 


USENIX Association 


USENIX Association 


False Pos 
113 (90) 











Condition 
Holding lock 
Double lock 
Double unlock 
Intr disabled 

Bottom half 

Bogus flags 









Table 4: Results of running the Linux synchroniza- 
tion primitives checker on kernel version 2.3.99. The 
Applied column is an estimate of the number of 
times the check was applied. We skipped twelve 
warnings that were difficult to classify. The paren- 
thesized numbers show the changes when the two files 
with the most false positives are ignored. 


1. All locks acquired within the function body are 
released before exiting. 


2. No execution paths attempt to lock or unlock 
the same lock twice. 


3. Upon exiting, interrupts are either enabled or 
restored to their initial state. 


4. The “bottom halves” of interrupt handlers are 
not disabled upon exiting. 


5. Interrupt flags are saved before they are re- 
stored. 


Table 4 shows the results of running the exten- 
sion on Linux. The “Applied” column is an estimate 
of the number of times each check was applied. Two 
device drivers account for a large number of false pos- 
itives because they use macros that consult runtime 
state before locking or unlocking. The parenthesized 
numbers show the changes in the false positive results 
(over 20%) when these two files are ignored. 

The most common bugs are either holding a lock 
or leaving interrupts disabled on function exit. These 
bugs often occur when detecting an error condition 
after which the function returns immediately. For 
example, the checker found this bug in a device driver 
for PCMCIA card services 


/* drivers/pcmcia/cs.c: 
pemcia_deregister_client */ 

spin_lock_irqsave(&s->lock, flags) ; 

client = &s->clients; 

while ((*client) && ((*client) != handle)) 
client = &(#client)->next; 

if (*client == NULL) 
/+ forgot about &s->lock, flags! */ 
return CS_BAD_HANDLE; 


The checks for Linux locking conventions have re- 
sulted in seven kernel patches, including a fix for the 
error shown above. All seven patches fix cases where 
a lock is mistakenly held when exiting a function, and 
six of the seven are in device drivers (the last patch 
was to an implementation of ipv4 network filters). We 
have not been able to confirm many of the other po- 
tential bugs with kernel or device driver developers, 
though several strong OS implementors have exam- 
ined them and consider them to be at least suspicious. 
Most of the potential bugs are in device drivers and 
networking code — this is not surprising since much 
of this code is written by developers throughout the 
world with varying degrees of familiarity with the 
Linux kernel. 


The false positives mostly come from three sources. 


Code that intentionally violates the convention for 
the sake of efficiency or modularity accounts for 90 
false positives. For example, sometimes a family of 
related device drivers will define an interface that 
breaks the conventions. Another large source of false 
positives (48) is caused by the fact that our checker 
only performs local analysis. Some drivers implement 
their own locking functions using the basic primitives 
provided by the system. The checker will warn when 
these functions exit holding a lock or with interrupts 
disabled, which is exactly what they are supposed to 
do. Global analysis could eliminate many of these 
false positives. Finally, the fact that our system does 
not prune simple, impossible paths accounts for 35 
false positives. A typical example of this is when ker- 
nel code conditionally acquires a lock, performs an 
action, and then releases the lock based on the same 
condition. There are only two possible paths through 
this code, not the four that our system thinks exist. 
The remaining 21 false positives could be elim- 
inated by extending the checker’s notion of locking 
functions and changing our system to prune the false 
branch of loop conditionals of the form “for(;;).” 


8 Optimizing FLASH 


In addition to checking, MC can be used for opti- 
mization. Below, we describe three extensions writ- 
ten to find system-level optimization opportunities 
in the FLASH machine’s cache coherence code [20]. 
This code must be fast because it implements func- 
tionality (cache coherence) that is usually placed in 
hardware. Eliminating even a single instruction is 
considered beneficial. Several of the protocols ex- 
amined here have been aggressively tuned for years 
due to their use in numerous performance papers as 
evidence for the effectiveness of software-controlled 
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14 


Optimization | Number | False Pos 
Buffer Free 11 
40 


hundreds 


Message Length 
XOR Opcode 





Table 5: MC-based FLASH optimizer results. Num- 
ber counts how many optimization opportunities 
were found. The XOR checker is written in an old 
version of the system — a version written in metal 
would be several factors smaller. 


cache coherence. Despite this effort, MC optimizers 
found hundreds of optimization opportunities, mostly 
due to the difficulty in manually performing equiva- 
lent searches across FLASH’s deeply nested paths. 

Buffer-free optimization. Each time a FLASH 
node receives a message, it invokes a customized mes- 
sage protocol handler that determines how to sat- 
isfy the request and update the protocol state. Han- 
dlers use the incoming message buffer to send out- 
going data messages, and must free it before exit- 
ing. Handlers can send data messages, which need 
a buffer, and control messages, which do not. Many 
handlers send more than one message when respond- 
ing to a request. To minimize the chance of losing 
a buffer, implementors are typically conservative and 
defer buffer freeing until the last handler send, ir- 
respective of whether the last send(s) was a control 
message and therefore did not need a buffer. Unfor- 
tunately, while this strategy simplifies handler code, 
it increases buffer contention under high load. 

Our extension indicates when buffer frees can oc- 
cur earlier in the code. It traces all sends on each path 
through the function, and by looking at send argu- 
ments, detects if the send (1) needs a buffer and (2) 
frees its buffer. It gives a suggestion for any path that 
has an active buffer that ends with a “suffix” of con- 
trol sends. The extension is 56 lines long, and found 
11 instances in a large FLASH protocol, “dyn_ptr,” 
where the buffer could be safely freed earlier. Each of 
these optimizations could be implemented by chang- 
ing only two lines of code. The extension also pro- 
duced nine false positives. Most of these were cases 
where the execution path was too complex to opti- 
mize without major code restructuring. 

Redundant length assignments. Our second, 
lower-level optimization extension detects redundant 
assignments to a message buffer’s length field. For 
speed, when sending multiple messages, implemen- 
tors set a buffer’s message length early in a handler 
and then try to reuse this setting across multiple mes- 
sages. Long path lengths make it easy to miss redun- 
dant assignments. Our checker detects redundancies 
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by recording the last assignment on every path and 
warning if there are two assignments of the same con- 
stant. It discovered 40 redundant assignments in the 
FLASH protocol code. 

Efficient opcode setting. Message headers must 
specify the message’s opcode (type). Opcode assign- 
ment costs two instructions. However, if the handler 
knows what opcode is currently in a header, it can 
change the opcode in one instruction by xoring the 
message header with the xor of the new and current 
opcode. Our extension detects such cases by com- 
puting when a message header, with known opcode, 
is assigned a new opcode. Both the old and new op- 
codes must be the same on all incoming paths. The 
extension determines the initial header value by look- 
ing in an automatically-built list of all opcodes a han- 
dler might receive. If there is only one possible op- 
code value, the extension records it and starts in a 
“known” state. Otherwise, the checker starts in an 
“unknown” state. It transitions from this state to 
the “known” state after the first opcode assignment. 
Each assignment encountered in the known state is 
annotated with the current opcode value. A second 
pass then checks every assignment and, if all paths 
reached it in the known state with the same opcode, 
emits a warning to the user that xor could be used to 
save an instruction. This checker found hundreds of 
such cases. 


9 Conclusion 


Systems are pervaded with restrictions of what ac- 
tions programmers must always or never perform, 
how they must order events, and which actions are 
legal in a given context. In many cases, these re- 
strictions link together the entire system, creating a 
fragile, intricate mess. Currently, systems builders 
obey these restrictions as well as they can. Unfortu- 
nately, system complexity makes such obedience dif- 
ficult to sustain. Programmers make mistakes, and 
often they have only an approximate understanding 
of important system restrictions. Such mistakes can 
easily evade testing, which rarely exercises all cases. 

We have shown that many system restrictions can 
be automatically checked and exploited using meta- 
level compilation (MC). MC makes it easy for imple- 
mentors to extend compilers with lightweight system- 
specific checkers and optimizers. Currently, a system 
rule must be understood by all implementors. MC 
allows one implementor, who understands this rule, 
to write a check that is enforced on everyone’s code. 
This leverage exerts tremendous practical force on 
the development of complex systems. 


USENIX Association 


USENIX Association 


False Positives 


Side-effects(§ 4.1) 

Static assert(§ 4.2) 
Stack check(§ 4.3) 
User-ptr(§ 5.1) 


Allocation(§ 5.2) 
Block(§ 6.2) 
Module(§ 6.3) 
Mutex(§ 7) 


Total a 





Table 6: The results of MC-based checkers sum marized over all checks. Error is the number of errors found, 
False Positives is the number of false positives, Uses is the number of times the check was applied, and 
LOC is the number of lines of metal code for the extension (including comments and whitespace). 


MC is a general approach, scaling from simple 
cases such as checking assertions up to global strate- 
gies for mutual exclusion and deadlock avoidance. We 
have demonstrated MC’s power by using it to check 
four real, heavily-used, and tested systems. It found 
bugs in all of them — roughly 500 in all — many 
of which would be difficult to find with testing or 
manual inspection. Further, these extensions typi- 
cally required less than a day and a hundred lines 
of code to implement. Curiously, writing code to 
check restrictions is significantly easier than writing 
code that obeys them. With few exceptions, our ex- 
tensions were written by programmers who, at best, 
only had a passing familiarity with the systems to 
which they were applied. We believe that these re- 
sults show that the use of meta-level compilation can 
significantly aid system construction. 
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Abstract 


To keep up with the frantic pace at which de- 
vices come out, drivers need to be quickly devel- 
oped, debugged and tested. Although a driver is 
a critical system component, the driver develop- 
ment process has made little (if any) progress. 
The situation is particularly disastrous when 
considering the hardware operating code (i.e., 
the layer interacting with the device). Writ- 
ing this code often relies on inaccurate or in- 
complete device documentation and involves 
assembly-level operations. As a result, hard- 
ware operating code is tedious to write, prone 
to errors, and hard to debug and maintain. 


This paper presents a new approach to de- 
veloping hardware operating code based on an 
Interface Definition Language (IDL) for hard- 
ware functionalities, named Devil. This IDL 
allows a high-level definition of the communi- 
cation with a device. A compiler automatically 
checks the consistency of a Devil definition and 
generates efficient low-level code. 


Because the Devil compiler checks safety crit- 
ical properties, the long-awaited notion of ro- 
bustness for hardware operating code is made 
possible. Finally, the wide variety of devices 
that we have already speciified (mouse, sound, 
DMA, interrupt, Ethernet, video, and IDE disk 
controllers) demonstrates the expressiveness of 
the Devil language. 


* Author’s current address: LaBR] / ENSERB, 351 
cours de la Libération, F-33405 Talence Cedex, France. 

tAuthor's current address: Trusted Logic, 5 rue 
du Bailliage, F-78000 Versailles, France. E-mail: 
Renaud.Marlet@trusted-logic.fr. 


1 Introduction 


A device driver is a key system component 
that makes hardware innovation available to 
end users. Device drivers are critical both 
in general-purpose computers and in the fast- 
evolving domain of appliances. If driver devel- 
opment falls behind, product competitiveness 
can be compromised. If a device driver is faulty, 
a hardware innovation may turn into a disaster 
instead of improving competitiveness. 


Still, ever since the first device drivers have 
been written, their development process has 
made little (if any) progress. This situation has 
particularly disastrous effects when considering 
hardware operating code (i.e., code communi- 
cating with the hardware). This layer of code 
is well-known to be low level and error prone. 


Hardware operating code is low level because 
it consists of many bit operations. Indeed, we 
have found that bit operations can represent 
up to 30% of driver code!. Such low-level pro- 
gramming is obviously prone to errors and re- 
quires tedious debugging. In fact, advances in 
programming languages have had no impact on 
the development of hardware operating code: 
there is no syntactic support for low-level oper- 
ations, there is no verification support to iden- 
tify incorrect usage of these operations, and 
there is no tool support to facilitate debugging. 


Additionally, hardware documentation typ- 
ically contains imprecise or inaccurate infor- 
mation. Therefore, writing hardware operating 


1This measurement was performed on various 
Linux 2.2-12 drivers. 


4th Symposium on Operating Systems Design and Implementation 


17 


18 


code typically involves laboriously searching for 
obscure incantations aimed at performing spe- 
cific operations on the device. Not only can 
this sometime cause unexpected behavior, but 
it also makes re-use of hardware operating code 
difficult. 


Finally, there are no recognized methodolo- 
gies for structuring device drivers. Even worse, 
a driver is often written by modifying an exist- 
ing one. As a result, the code quickly becomes 
tangled, which makes debugging and mainte- 
nance complex. 


Our proposal 


This paper describes a new approach to de- 
veloping the hardware operating layer of a 
driver. Our approach allows drivers to be writ- 
ten in a high-level language, allows important 
safety properties to be checked, and allows low- 
level code to be automatically generated. 


We introduce an Interface Definition Lan- 
guage (IDL) to describe hardware function- 
alities, named Devil. IDLs are extensively 
used in modern OSes, either to hide hetero- 
geneity and intricacies of message construction 
in distributed systems [3, 13], or to glue to- 
gether components in modular operating sys- 
tems (2, 9, 10]. Just as RPC IDLs convention- 
ally define operations and their input/output 
types, Devil specifies the functional interface 
of the device. To do so, it provides the pro- 
grammer with abstractions and syntactic con- 
structs that are specific to describing devices. 
From a Devil specification, a compiler automat- 
ically generates stubs containing low-level code 
to operate the device. Furthermore, verifica- 
tion tools enable critical safety properties to 
be checked at compile time, and at run time if 
necessary. 


Just as an IDL typically allows code to be 
re-used, a Devil specification can be re-used in 
different contexts (e.g., various operating sys- 
tems). More generally, our vision is that Devil 
specifications either should be written by de- 
vice vendors or should be widely available as 
public domain libraries in order to ease driver 
development. 


Our contributions are as follows. 


e We have designed and implemented an 
IDL for devices. This language is an alter- 
native to assembly-language-like program- 
ming of devices. 


e We propose tools to verify critical safety 
properties of hardware operating code. 
These tools enable us to provide the long- 
awaited notion of robustness for device 
drivers. 


e We present a comparison between Devil 
specifications and existing driver code. 
This comparison is based on experimen- 
tal data which demonstrate that a Devil 
specification is up to 5.9 times less prone 
to errors than C code, with almost no loss 
in performance. 


The rest of this paper is organized as fol- 
lows. Section 2 presents the Devil language. 
Section 3 describes the safety properties that 
can be verified both statically on Devil specifi- 
cations and dynamically by the generated inter- 
face. Section 4 assesses the benefits of our ap- 
proach by comparing hand-crafted drivers with 
equivalent ones written using Devil. Section 5 
describes related work. Section 6 concludes and 
suggests future work. 


2 Devil 


Devil is an IDL for specifying the functional 
interface of a device. ‘To design Devil, we 
have studied a wide spectrum of devices and 
their corresponding drivers, mainly from Linux 
sources: Ethernet, video, sound, disk, inter- 
rupt, DMA and mouse controllers. This study 
was supported by literature about driver devel- 
opment [7, 16], device documentation available 
on the web, and discussions with device driver 
experts for Windows, Linux and embedded op- 
erating systems. Devil has proved expressive 
enough to describe even devices having a con- 
torted interface such as the Crystal CS4236B 
sound controller. 


Concretely, a device can be described by 
three layers of abstraction: ports, registers, and 
device variables. The entry point of a Devil 
specification is the declaration of a device, pa- 
rameterized by ports or ranges of ports, which 
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device logitech_busmouse (base : bit([8] port ® {0..3}) 
t 


// Signature register (SR) 
register sig_reg = base @ 1 : bit[8]; 
variable signature = sig_reg, volatile, write trigger : int(8); 


// Configuration register (CR) 
register cr = write base 0 3, mask 71001000.’ ; bit [8]; 
variable config = cr[0] : { CONFIGURATION => °1’, DEFAULT_MODE => ’0? }; 


// Interrupt register 
register interrupt_reg = write base @ 2, mask ’000.0000’ : bit[8]; 
variable interrupt = interrupt_reg[4] : { ENABLE => ’0’, DISABLE => ’1’ }; 


// Index register 
register index_reg = write base @ 2, mask ’1..00000’ 
private variable index = index_reg[6..5] : int(2); 


: bits]; 


: bit(8]; 
: bit[s]; 
: bit(8]; 
: bit[s]; 


0}, mask 
1}, mask 
2}, mask 
3}, mask 


read base @ 0, pre {index 
read base @ 0, pre {index 
read base @ 0, pre {index 
read base @ 0, pre {index 


register x_low 
register x_high 
register y_low 
register y_high 
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structure mouse_state = { 


variable dx = x_high[3..0] # x_low[3..0], volatile : signed int(8); 
variable dy = y_high{3..0] # y_low[3..0], volatile : signed int(8); 
variable buttons = y_bhigh[7..5], volatile : int(3); 





Figure 1: Logitech Busmouse Specification 


abstract physical addresses. Ports then allow 
device registers to be declared; these define the 
granularity of interactions with the device. Fi- 
nally, device variables are defined from regis- 
ters, forming the functional interface to the de- 
vice. 


These three layers of abstraction are illus- 
trated by the following fragment of the Devil 
description of the Logitech Busmouse con- 
troller (see Figure 1 for a complete description). 


device logitech_busmouse(base : bit{8] port@{0..3}) 
{ 

register sig_reg = base @ 1 : bit[8]; 

variable signature = sig_reg, ... : int(8); 


: yee 


The logitech_busmouse declaration is param- 
eterized by a range of ports specified as the 
main address base and a range of offsets (from 
0 to 3). An eight-bit register sig_reg is de- 
clared at port base, offset by 1. Finally, the 
device variable signature is the interpretation 
of this register as an eight-bit unsigned inte- 
ger. This fragment declares a device whose 
functional interface consists of a device variable 
(signature). Only device variables are visible 
from outside a Devil description ports and reg- 
isters are hidden. In fact, for each variable the 
Devil compiler generates two C stubs that per- 


mit to write or read the variable by emitting 
the proper I/O operations. 


In the rest of this section, we first describe 
the basic Devil constructs, and then present ad- 
vanced Devil features that allow the description 
of devices with contorted addressing modes. 


2.1 Basic Devil 


Ports, registers, and device variables are the 
basic layers of abstraction that describe the in- 
terface of a device. We now present their usage 
by describing in detail the Devil specification 
of the Logitech Busmouse (see Figure 1), and 
a fragment of the NE2000 Ethernet controller. 


Ports. The port abstraction is at the basis 
of the communication with the device. A port 
hides the fact that, depending on how the de- 
vice is mapped, it can be operated via either 
I/O or memory operations. A device often has 
several communication points whose addresses 
are derived from one or more base addresses. 
Therefore, the port constructor, denoted by @, 
takes as arguments a ranged port and a con- 
stant offset (e.g., base@1 as illustrated by line 4 
of the Busmouse specification). To enable veri- 
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fication, the range of valid offsets must be spec- 
ified within the entry point declaration (e.g., 
port@{0. .3} as illustrated by line 1 of the Bus- 
mouse specification). 


Registers. Registers define the granularity 
of interaction with a device; as such register 
size (in number of bits) must be explicitly spec- 
ified. Registers are typically defined using two 
ports: one for reading and one for writing. 
Only one port needs to be provided when read- 
ing and writing share the same port, or when 
the register is read-only or write-only. 


A register declaration may be associated 
with a mask to specify bit constraints. An el- 
ement of this mask can either be ‘*’ to denote 
a relevant bit, ‘0’ or ‘1’ to denote a bit that is 
irrelevant when read but has a fixed value (0 or 
1) when written, or ‘~’ to denote a bit that is ir- 
relevant whether read or written. As an exam- 
ple, consider the declaration of the write-only 
register index_reg in line 16 of the Busmouse 
specification. 


register index_reg = 


write base@2, mask ’1..00000’ : bit{@]; 


This mask indicates that only bits 6 and 5 
are relevant. Also, bit 7 is forced to 1 when 
written while bits 4 through 0 are forced to 0. 
Proper register masking is performed as part of 
the stubs generated by the Devil compiler. 


Device variables. In order to minimize the 
number of I/O operations required for com- 
municating with a device, hardware designers 
often group several independent values into a 
single register. Accessing these values requires 
bit mask and shift operations which are error- 
prone in a general programming language such 
as C. Devil abstracts values as device variables, 
which are defined as a sequence of bit regis- 
ters. Device variables are strongly typed in or- 
der to detect potential misuses of the device. 
Possible types are booleans, enumerated types, 
signed or unsigned integers of various sizes, and 
ranges or sets of integers. In line 17 of the Bus- 
mouse specification, the 5th and 6th bit of the 
index_reg register make up atwo-bit unsigned 
integer variable (7.e., a variable that can take 
a value from 0 to 3). The private attribute 
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means that the index variable is not defined in 
the functional interface of the Busmouse con- 
troller and can not be directly accessed by the 
driver programmer. 


private variable index = index_reg[6..5] : int(2); 
Access pre-actions. Device functionalities 
are often extended by mapping multiple reg- 
isters to a single physical address. Examples 
are index-based addressing mode and banks of 
registers. As a result, accessing such registers 
requires the setting of a specific context which 
may involve several I/O operations. To cap- 
ture this situation, Devil allows pre-actions to 
be attached to a register. Lines 19 and 20 of 
the Busmouse specification declare two read- 
only registers on the same port base@0, pro- 
vided that the variable index is set either to 0 
or 1 prior to the port access. 


register x_low = read base@O, mask '*#*",..., 
pre {index = 0} : bit[8]; 
register x_high = read base@O, mask '*#*#,.,.’, 


pre {index = 1} : bit[8); 


Register concatenation. Device variables 
can be spread over several registers. As illus- 
trated by line 25 of the Busmouse specification, 
constructing the dx variable requires concate- 
nation of the two registers x_high and x_low. 
The 8-bit variable dx is obtained by concaten- 
ing the four lower bits of register x_high with 
the four lower bits of register x_low. 


variable dx = x_high(3..0] # x_low{3..0], ... 


Enumerated types. Devil allows defining 
an enumerated type to abstract the concrete 
representation of bit values. The symbols <=, 
=> and <=> define read, write and read-write 
constraints, respectively. Enumerated types 
are used to specify the valid values of a de- 
vice variable. As an example, the config vari- 
able declaration shown in line 9 of the Bus- 
mouse specification declares the two modes 
(CONFIGURATION and DEFAULT_MODE) that can 
be written to the config variable. 


variable config = cr[0] : { 
CONFIGURATION => ’1’, DEFAULT_MODE => ’0’ }; 


Caching and synchronization. Sharing 
one or more registers between variables induces 
cache and synchronization problems. When 
one variable needs to be written independently 
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from the others, the Devil compiler has to de- 
termine a value to assign to the other vari- 
ables. The choice of value depends on whether 
the access to that variable is idempotent. A 
Devil variable can be associated with a behav- 
ior qualifier that specifies the access semantics. 
No qualifier (the default case) means that the 
access is idempotent and thus can be redone 
without side effect; consequently, the variable 
value can be cached. Such a behavior is often 
associated with variables that serve as param- 
eters. 


A trigger behavior means that a write (or 
read) access to the variable induces a side effect 
on the controller. Since the side effect cannot 
be re-done, multiple trigger variables cannot be 
defined on a register unless a neutral value is 
provided. Command variables usually have a 
trigger behavior. The following fragment from 
an NE2000 Ethernet controller presents exam- 
ples of the trigger behavior. 


register cmd = base@O : bit([8]; 
variable st = cmd[1..0], 

write trigger except NEUTRAL; 
variable txp = cmd[2], 

write trigger except NOP; 
variable rd = cmd(5..3], 

write trigger except NODMA; 


private variable page = cmd[7..6] : int(2); 


In this example, the register cmd is split into 
four variables. While the page variable has an 
idempotent behavior, the variables st, txp and 
rd trigger an action when written, except for 
specific values (NEUTRAL, NOP and NODMA).? 


Finally, a volatile behavior specifies that a 
read operation is not idempotent; two succes- 
sive reads may deliver different values. When 
one needs to get a consistent value of several 
volatile variables, it is necessary to read them 
together in one or multiple read operations and 
cache the result for later use. To do so, Devil 
allows several variables to be grouped using a 
structure. The use of a structure is demon- 
strated by the dx, dy and buttons variables 
of the Busmouse specification (lines 19 to 22). 


structure mouse_state = { 
variable dx = 
x_high[3..0] # x_low[3..0], volatile :... 
variable dy = 
y_high[3..0) # y_low[3..0], volatile :... 
variable buttons = y_high(7..5], volatile: ... 
}; 


2These values are defined using an enumerated type, 
not shown here. 


To access field variables dy and buttons, the 
programmer first has to read the mouse_state 
structure. Stubs generated for the structure 
perform the effective I/O operations, while 
stubs for the field variables access only the 
cache. It should be noted that since dy and 
buttons share the y_high register, y_high is 
read only once. Use of the stubs by the driver 
programmer is detailed in section 4.1. 


Cache and synchronization issues are usu- 
ally only informally documented by hardware 
vendors. When programming controllers in 
a general programming language, cache and 
synchronization issues are typically solved in 
an ad-hoc manner that limits code re-use and 
driver evolution. In fact, the lack of a rigorous 
description of variable behaviors often leads to 
laborious testing until the expected functional- 
ity is obtained. Also, without specific language 
support, no verification of the correct usage of 
variables is possible; this opens opportunities 
for undetected errors. 


Assessment. By clearly defining the seman- 
tics of variable behavior, a Devil specification 
serves as knowledge repository for the correct 
use of a device. In fact, the driver programmer 
is guided by the interface generated from the 
Devil specification. This simplifies driver de- 
velopment and improves re-use. Furthermore, 
verification is possible at two design stages: (i) 
on the Devil specification itself so as to check 
consistency of declarations, (ii) on the correct 
usage of interface procedures generated by the 
Devil compiler. These advantages are even 
more crucial when the device interface is awk- 
ward and contorted. The next section presents 
advanced Devil constructions which permit to 
handle these situations. 


2.2 Advanced Devil 


To maximize performance, most modern de- 
vices offer a simple, flat interface to registers. 
However, devices are rarely built from scratch 
and many of them are evolutions or supersets of 
previous controllers. For example, today’s PCs 
still rely on DMA, interrupt and graphics con- 
trollers that were designed more than twenty 
years ago. 
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Design constraints of older devices were 
guided not only by performance but also by 
technology and the size of the available I/O 
address space. Adding functionalities to a de- 
vice while maintaining backward compatibil- 
ity induces tricks for addressing additional reg- 
isters. These issues result in contorted ad- 
dressing modes, making the programming of 
such devices even more complex and error- 
prone. Devil has been specifically targeted to- 
wards supporting such devices. Let us now 
present some of the advanced Devil features 
using fragments from the Devil specifications 
of the 8237A DMA, the 8259A interrupt, the 
Crystal CS4236B, and the IDE controllers. 


Register serialization. The 8237A DMA 
controller provides 16-bit counters through a 
single 8-bit port. As illustrated by the following 
example, constructing the counter x requires 
concatenation of the two registers cnt_high 
and cnt_low. Since these registers are accessed 
through the same port, a reading order has to 
be specified (cnt_low then cnt_high). Finally, 
a pre-action attached to cnt_low (write any 
value to the flip-flop variable) permits to re- 
set an internal pointer to this register. 


register cnt_low = 
data, pre {flip_flop = *} : bit[8]; 
register cnt_high = data : bit[8]; 
variable x = cnt_high # cnt_low : int(16) 
serialized as {cnt_low; cnt_high}; 


Control-flow based serialization. The 
8259A interrupt controller possesses various ex- 
ecution modes that depend on the hardware 
configuration (processor type, cascaded/single 
controller) [12]. Initialization of the controller 
is performed by writing to configuration vari- 
ables defined over four initialization registers. 
The initialization sequence varies with the ac- 
tual values of configuration variables. Addi- 
tionally, three of the configuration registers 
(e.g., icw2, icw3, icw4) are mapped to a 
single port and their addressing is implicitly 
done by previously written configuration val- 
ues. The following example shows how such 
an addressing mode can be specified in Devil: 
configuration variables are grouped together 
within the init structure. Writing variables 
of this structure into registers is ordered using 
tests on variable values. 
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register icwl = 


write base@O, mask ’...1....? : bit(8]; 
register icw2 = write base@i : bit[8]; 
register icw3 = write base@i : bit([8]; 
Tegister icw4 = 

write base@1, mask ’000..... * > bit(s]; 


structure init = { 
variable sngl = icwi[i] : { 
SINGLE => ?1’, CASCADED => ’0’ }; 
variable ic4 = icw1[0] : bool; 


variable microprocessor = icw4[0] : { 
X8086 => 71’, MCS80_85 => °0’ }; 
} serialized as { 


icw1l; 
icw2; 
if (sngl == SINGLE) icw3; 
if (ic4 == true) icw4; 
}; 
Automata based addressing mode. 


Among the chips we have studied, the Crys- 
tal CS4236B sound chip is one of the most 
complex. This chip is compatible with the 
Windows Sound System standard [5], but pos- 
sesses 18 additional registers. These registers 
are doubly indexed through the 123 index. 
Writing a specific device variable converts 
I23 from an extended address register into an 
extended data register. To convert I23 back to 
an address register, the control register must 
be written. In order to specify this automata, 
Devil offers the notion of private variables that 
are not mapped to a specific register (xm in 
the following example). These variables can be 
used as memory cells and can be updated when 
writing a register or a device variable. The 
code below shows how the extended registers 
of the CS4236B can be specified using Devil. 


private variable xm : bool; 
register control = 

base@O, set {xm = false} : bit(8]; 
variable IA = control : int{0..31}; 


// Indexed Registers 10 - I31 
register I(i : int{0..31}) = 


base@1, pre {IA = i} : bit[8]; 
register I23 = 1(23), mask ’...... 0.’; 
: bool; 


variable ACF = I23[0] 
structure XS = { 
variable XA = 123[2,7..4] : int(5); 
variable XRAE = 123[3], set {xm = XRAE}, 
write trigger for true : bool; 


i 


// Baxtended Registers XO-X17,X25 
register X€j ; int{0..17,25}) = base@1, 
pre {XS = {XA=>j; XRAE=>true}} : bit[8]; 


Block transfer. On some processors, such 
as those of the Pentium family, replacing a 
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C loop over a variable read/write by a dedi- 
cated looping instruction (e.g., rep on the Pen- 
tium) is often more efficient. Variables with a 
block transfer usage have to be identified with a 
block keyword. For those variables, the Devil 
compiler generates two processor-specific block 
transfer stubs in addition to the single access 
stubs. The Ide_data variable declaration from 
the IDE specification shown below illustrates 
the use of the block attribute. 


variable Ide_data = 


ide_data, trigger, volatile, block : int(16); 


Other features of Devil are not detailed here. 
These features include access post-actions, ar- 
rays, register constructors and conditional dec- 
larations depending on device modes. A com- 
plete description of Devil can be found in [17]. 


3 Property Verification 


Devil has been designed to express domain- 
specific information about the functional inter- 
face of devices. Because this information is 
made explicit, Devil enables a variety of ver- 
ifications that are beyond the scope of general 
programming languages. As a result, more er- 
rors can be caught earlier in the driver develop- 
ment process. In turn, debugging is easier and 
less time-consuming. Finally, the robustness of 
the driver is improved since the programmer 
has guarantees over the correctness of low-level 
interactions. 


This section summarizes the properties that 
can be verified both when a Devil description 
is compiled and when the resulting interface 
implementation is used. 


3.1 Verification of Devil specifica- 
tions 


Due to the declarative nature of the Devil 
language, it is possible to verify the follow- 
ing properties that ensure the consistency of 
a specification: 


Strong typing. Devil abstractions (e.9., 
ports, registers, variables) are strongly typed: 
all uses of these abstractions can be matched 


against their definition to check type correct- 
ness. Types describe usage constraints for reg- 
isters and variables that are read or write only. 
Also, various size checks can be performed: the 
size of data accesses on ports, the size of regis- 
ters, the size of variables derived from conver- 
sion functions, the size of bit masks, and the 
size of bit patterns that are associated a sym- 
bolic name in enumerated types, port ranges, 
and bit ranges for register fragments. 


No omission. All declared entities in a Devil 
specification must be used at least once. This 
constraint concerns port arguments in a device 
declaration, values of ranged port offsets, regis- 
ters, and register bits (although some bits can 
be declared irrelevant using bit masks). Read 
elements of a type mapping must be exhaus- 
tive. Also, a type for reading (as well as possi- 
bly writing) must be used with a readable vari- 
able. The same holds for writing. 


No double definition. All entities in a 
Devil specification must be declared at most 
once. This constraint concerns port arguments 
in a device declaration, ports, registers, types, 
symbolic names and bit patterns in enumerated 
types and variables. 


No overlapping definitions. Port and reg- 
ister descriptions must not overlap. More pre- 
cisely, each port must appear only once in 
the register definitions, except when registers 
are defined using disjoint pre-actions or masks. 
However, the same port may be used for read- 
ing from one register and writing to another. 
No bit of a single register can be used in the 
definition of two different variables. 


3.2 Verification of interface usage 


Verification of the correct usage of the gener- 
ated interface can be both static and dynamic. 
In the latter case, run-time checks are option- 
ally included in the code for debugging pur- 
poses. 


When writing to a variable, a check can be 
performed to verify that the written value falls 
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within the range specified by the variable type. 
If the value is constant, the check can generally 
be done at compile time. However, because the 
type system of C is not powerful enough to ex- 
press all Devil types, not all such verifications 
can be implemented at compile time. In this 
situation, checks have to be implemented in de- 
bug mode using run-time checks. Finally, run- 
time checks can optionally be generated after 
variable reads. Such checks are useful for ver- 
ifying that a device behaves accordingly to its 
Devil specification. 


Our experience in re-engineering drivers 
showed that dynamic checks allow the early de- 
tection of usage errors, preventing them from 
becoming insidious bugs. This is particularly 
valuable for kernel-mode drivers, which are 
tricky to step through with a debugger. More- 
over, since the checks are automatically and 
systematically inserted and removed by the 
compiler, their use is easy and safe. 


4 Comparison with Hand- 


Crafted Drivers 


To assess our approach, we now compare the 
use of Devil and C. First, we analyse issues re- 
lated to code development. Then, we report on 
a study based on mutation analysis to evaluate 
the robustness of Devil and C implementations. 
Finally, we discuss the performance of drivers 
that use the C library automatically generated 
from a Devil specification. 


4.1 Driver development 


To illustrate the benefits of Devil in terms 
of separation of concerns and readability, we 
compare a fragment of the original C imple- 
mentation of the Logitech Busmouse driver (see 
Figure 2) with the use of the interface (see 
Figure 3) generated from the equivalent Devil 
specification. 


In a traditional C driver, the program- 
mer writes code that accesses the device with 
assembly-language-level operations (e.g., bit 
manipulations). For example, the C code 
needed to express the concatenation of the four 
lower bits of registers y_high and y_low is te- 
dious. As shown in Figure 2-a, macros are often 
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defined so as to factorize common expressions 
or associate names with commands. Neverthe- 
less, it is rather difficult to understand the be- 
havior of the device from the implementation; 
maintenance of this code is error-prone and not 
easy. 


Using Devil, driver development is a two 
stage process: first the chip is specified in Devil, 
then code is written using the stubs generated 
from the specification. Describing the device 
as opposed to coding improves readability. For 
instance, the Devil description of the variable 
dy in the Busmouse specification (see line 26 of 
Figure 1) consists of a straightforward concate- 
nation of two bit-fragments. The Devil specifi- 
cation is so close to a device description that it 
can be used for documentation purposes. 


When writing the driver code, the program- 
mer first has to include Devil-generated stubs 
and to speficy configuration information. For 
instance, in Figure 3-a, Busmouse stubs are 
used in debug mode and in a single device 
configuration (#define DEVIL_NO_REF). Fur- 
ther communication with the device is encapsu- 
lated in stubs (see Figure 3-b). Therefore, the 
driver programmer only has to focus on oper- 
ating the device using abstract values. Writing 
the hardware operating code becomes a very 
simple task, especially if the programmer can 
use an existing Devil specification. 


4.2 Robustness 


As discussed in Section 3, Devil exposes 
properties that can be automatically checked. 
This section evaluates the benefits of these 
checks in terms of software robustness. 


Detecting bugs as early as possible is cru- 
cial during the development process. A study 
by DeMillo and Mathur found that simple er- 
rors (e.g., typographic errors, inattention er- 
rors) represent a significant fraction, though 
not the majority, of the errors in production 
programs. This study also revealed that such 
errors can remain hidden for a long time. Even 
though their study was concerned with the de- 
velopment of TeX, which differs from device 
drivers, these observations remain pertinent, 
and are even more important considering the 
permissive nature of a language such as C, es- 
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#define MSE_DATA_PORT 0x23c 
#define MSE_CONTROL_PORT 0x23e 


#def ine MSE_READ_Y_LOW OxcO 


#define MSE_READ_Y_HIGH OxeO 


2a. Macro definition 





dy = (inb(MSE_DATA_PORT) & Oxf); 
outb(MSE_READ_Y_HIGH, MSE_CONTROL_PORT) ; 
buttons = inb(MSE_DATA_PORT) ; 

dy |= (buttons & Oxf) << 4; 

buttons = ((buttons >> 5) & 0x07); 


2b. Macro usage 


Figure 2: Fragment of the original Linux driver for the Logitech Busmouse 


#define DEVIL_NO_REF 
#define dev_name bm 
#define DEVIL_DEBUG 
#include "busmouse.dil.h" 


3a. Interface usage 


#define bm_get_mouse_state() ( \ 


bm_get_mouse_state() ; 
dy = bm_get_dy(); 
buttons = bm_get_buttons(); 


3b. Stub usage 


outb(1, bm_cache.__dil_base__+2); bm_cache.cache_mouse_state.cache_get_x_high = inb(bm_cache.__dil_base__); \ 


outb(0, bm_cache.__dil_base__+2); bm_cache. cache_mouse_state.cache_get_x_low 


inb(bm_cache.__dil_base__); \ 


outb(3, bm.cache.__dil_base__+2); bm_cache.cache_mouse_state.cache_get_y_high = inb(bm_cache.__dil_base__); \ 


outb(2, bm_cache.__dil_base__+2); bm_cache.cache_mouse_state.cache_get_y_low 


#define bm_get_dy() ( \ 


inb(bm_cache.__dil_base__) ) 


(bm_cache. cache_mouse_state.cache_get_y_high & Oxfu) << 4 | bm_cache.cache_mouse_state.cache_get_y_low & Oxfu) 


#define bm_get_buttons() ((bm_cache.cache_mouse_state.cache_get_y_high & Oxe0u) >> 5) 





3c. Generated stubs (after inlining) 


Figure 3: Fragment of the Devil based driver for the Logitech Busmouse 


pecially when used to write low-level code. 


In order to evaluate the impact of Devil on 
driver robustness, we have estimated the num- 
ber of errors that can be detected automati- 
cally by the C and Devil compilers/checkers.* 
The error-detection coverage is computed using 
a mutation analysis technique [1, 8]. 


For a program P, mutation analysis produces 
a set of alternate programs, each generated by 
modifying a single statement of P, according to 
mutation rules. In our experiment, the muta- 
tion rules introduce errors in operators, iden- 
tifiers and literal constants. Such errors are 
generated by inserting, replacing or removing 
a character from the targeted token. For ex- 
ample, the logical operator || can be replaced 
by the bit operator |, the number 121 can be 
replaced by 21, etc. Mutation rules are defined 
so as to ensure that the resulting mutant is syn- 
tactically correct, and actually modifies the se- 
mantics of the program. Therefore, detection 
of the mutation introduced error by the com- 


3In our current experiments, the benefit of run-time 
checks in Devil generated interfaces are not taken into 
account. 


piler occurs only if the mutant violates a prop- 
erty of the language (e.g., C or Devil). 


In a C driver, we are only interested in test- 
ing the hardware operating code. Accordingly, 
we manually insert tags to mark the corre- 
sponding regions in the original C code, and 
only apply mutations to the tagged regions. In 
a Devil-based driver, mutations have to be ap- 
plied both to the Devil specification of the de- 
vice, and to procedure calls to the generated 
interface (this C code is denoted by Cp.., in 
the rest of the paper). 


Our experiments compare the error- 
detection coverage of C against the error- 
detection coverages of the Devil specification 
and Cp...  %It should be noted that our 
measurements reflect the worst case for Devil 
for the following reasons. First, the mutation 
rules for C and Devil have been chosen so 
that C is always favored. Second, since a 
driver often uses a subset of a device, the 
Devil specification offers more mutation sites 
(possible errors) than the original C driver. 
Finally, Devil specifications should ideally 
come from the device manufacturer or widely 
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Number of Undetected Mutation Sites " 

Devi Language mutation pauts nee mutants with Ratio 
eas Buse F per site ‘ toC 

lines sites per site undetected mutants 
Cc | 36 ee 
Logitech || Devi BI 15.9 a 

Busmouse Crdevun 21 138 5. Tae 5.9 
Devilt+Cpevit TO? 15.4 1.2 8.7 5.2 


IDE 
(Intel PIIX4) 


Ethernet 
(NE2000) 





Table 1: Language Error-Detection Coverage Analysis 


available public-domain libraries. Thus, one 
can expect them to be bug-free and errors only 
to appear in Cpev- 


Measurement analysis. Our study focuses 
on three different devices (e.g., Logitech Bus- 
mouse, NE2000 Ethernet, and IDE controllers) 
and their corresponding Linux 2.2-12 drivers. 
Table 1 presents the results of the mutation 
analysis. Overall, the experiments show that 
the probability of undetected errors is 1.6 to 
5.2 times higher in C hand-crafted drivers than 
in Devil-based driver (Devil + Cp. ). When 
comparing C to Cp,,, only (assuming that the 
specification is correct), the propensity of un- 
detected errors 3.2 to 5.9 times higher in C. 
Finally, it can also be observed that mutation 
errors in Devil specifications are nearly always 
detected. 


The first column of Table 1 represents the 
number of possible mutation sites (s). The sec- 
ond column shows the number of mutants (i.e., 
errors) which can be injected for each site (ms). 
For example, given an integer of two digits in 
base ten, 50 mutants can be generated (2 for 
removing a digit, 30 for inserting a new digit, 
and 18 for replacing a digit). The third column 
shows, for each mutation site, the number of 
mutants not detected by the compiler/checker 
(ums). 


To enable the comparison between C, Devil 
and Cyp.vi we are interested in measuring the 
number of mutation sites that have undetected 
mutants (Sum). To compute this value, we 
have to balance the number of undetected mu- 
tants per site by the number of mutation sites 
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(Sum = um,/m,*s). For example, consider the 
Logitech Busmouse C driver. It has 62 muta- 
tion sites. For each site, 36.6 mutants are gen- 
erated (on average) and 26.8 are not detected 
by the compiler. This give us 45.3 sites with 
undetected mutants. 


4.3. Performance 


It is well-recognized that the performance 
of drivers is critical for the overall system 
performance. Furthermore, as demonstrated 
by Thekkath and Levy for high-performance 
RPCs [18], the performance of the hardware 
operating code has a significant impact on the 
overall driver performance. While Devil can 
improve readability and robustness of driver 
hardware operating code, its usefulness de- 
pends on the efficiency of the generated code: 
using Devil must not induce significant execu- 
tion overhead. 


In order to evaluate the benefit and impact 
of Devil on driver development, we are re- 
engineering various Linux drivers and testing 
them on a bi-processor PC.4 Among the drivers 
and devices in a Unix system, we chose to im- 
plement first the IDE and the accelerated X11 
drivers for two reasons: (i) they are representa- 
tive of performance intensive drivers and they 
illustrate totally different device access behav- 
ior. 


In the rest of this section, we first identify 


4The PC is a DELL Precision 210 with the follow- 
ing configuration: two Pentium ff 450 MHz, Intel PIIX4 
PCI chipset, Maxtor model 91000D8 UDMA2 19.5Gb 
disk with 512Kb cache, 3Dlabs Permedia2 graphic con- 
troller. 
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tandard driver Devil driver 


Sectors 
per 


interrupt | in bits Operations 


throughput? 


Throughput 1/O Throughput / 
in Ses _ in Mb/s ontis 


| 7+ #s(1 + 128) — i 128) =e 93 10 + #s(3 + 128) es 36 91 % 
7 +#s(14+256) | 406 | 10+4#s(3+ 256) , 363. i 89 % 


Table 2: IDE Linux driver comparative performance results (using C loops) 


the possible penalties induced by Devil, and 
then we compare the performance of the IDE 
and accelerated X11 Devil-based drivers with 
the original ones. 


Micro-analysis Interface procedures gener- 
ated by the Devil compiler contain I/O as well 
as bit-shift and bit-mask instructions. These 
procedures are optimized by the Devil compiler 
and implemented as pre-processor macros or in- 
lined functions. Therefore, there is no execu- 
tion overhead for a single Devil interface pro- 
cedure as compared to hand-crafted C instruc- 
tions. 


In one situation, we observed that Devil 
could induce an execution penalty. Accessing 
independent device variables (i.e., variables not 
grouped in a structure) defined over a single 
register, requires multiple Devil interface calls. 
Each additional call induces additional I/O, as 
compared to an hand-crafted driver. Neverthe- 
less, as we found in our re-engineering of the 
IDE and Permedia2 driver, such variables are 
often parameters and rarely affect the perfor- 
mance of the critical path. 


IDE driver ‘Table 2 compares the perfor- 
mance of a Devil-based IDE driver with that 
of the original C driver. IDE throughput mea- 
surements were obtained using the standard 
Linux hdparn utility. We wrote two Devil spec- 
ifications for this driver: a specification of the 
IDE controller and a specification of the Intel 
PITX4 PCI busmaster IDE. 


We have run the IDE driver in both Ultra 


7+ a 25) 4.45 lo+ ses ‘ 
| 128) 8.09 10 + Set. 128) i 89 % 


10 | 10 + S8SE280 | or BE 88 % 





DMA-2 and several PIO modes, varying the 
size of I/O (16 or 32 bits) and the number 
of sectors transfered per interrupt. In DMA 
mode, Devil induces 6 additional I/O opera- 
tions to prepare the command. Because of the 
long duration of the DMA transfer, there is no 
impact on the available throughput. In the PIO 
modes, there are 3 additional I/O operations 
to prepare the command, plus 2 for each inter- 
rupt (#s denotes the total number of sectors 
accessed). When using a C loop over a single 
variable read, we measured a 10% throughput 
penalty. When using block transfer stubs that 
use a rep instruction, we did not observe an 
impact on the available throughput. 


Permedia2 X11 driver Tables 3 and 4 
show the performance Devil-based X11 driver 
for the 3Dlabs Permedia2 graphics controller. 
Throughput measurements were obtained us- 
ing the xbench utility. We have modified the 
3Dlabs X11 server, which is based on a Xfree86- 
3.3.6 implementation. Although the Perme- 
dia2 chip provides acceleration for both 2D 
and 3D, the X11 server does not support 3D 
operations. Additionally, to minimize device- 
dependant code, many 2D primitives are imple- 
mented in software in Xfree86. In fact, hard- 
ware acceleration is only used for implementing 
the fill rectangle and screen area copy 
primitives. 


Unlike many I/O devices, the Permedia2 
controller maps registers into the memory ad- 
dress space. In fact, processor accesses are de- 
coded by the controller and stored in a FIFO. 
Before accessing the chip, the driver must wait 
for free entries in the FIFO. This wait loop in- 
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Table 3: Comparative Performance of Permedia2 Xfree86 Driver: Rectangle Test 


Display WO Driver 
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Table 4: Comparative Performance of Permedia2 Xfree86 Driver: Screen Copy Test 


duces one I/O operation per iteration. In Ta- 
bles 3 and 4, #w denotes the number of itera- 
tions per wait loop. In the driver we modified, 
2 or 3 wait loops are performed per primitive 
call. 


The time for execution of a drawing com- 
mand by the Permedia2 controller is propor- 
tional to the number of drawn pixels and their 
depth. Therefore, the overhead induced by 
Devil is more perceptible for shortest com- 
mands. The worst case is reached for 2x2 pixel 
commands in 8 or 16 bit mode, where Devil 
induces a performance penalty of up to 6%. 
For primitive calls involving more than 100 pix- 
els (which are the most common in practice), 
99% to 100% of the performance of the origi- 
nal server is obtained (always 100% in 24 bit 
mode). 
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5 Related Work 


Our work on device drivers started with a 
study of graphic display adaptors for a X11 
server. We developed a language, called GAL, 
aimed at specifying device drivers in this con- 
text [19]. Although successful as a proof of con- 
cept, GAL covered a very restricted domain. 


The goal of the UDI project® is to make 
device drivers source-portable across OS plat- 
forms. To do so, they have normalized the API 
between the OS and the lower part of device 
drivers [14]. Besides showing the timeliness of 
our work, UDI focuses only on the high-level 


>The UDI (Uniform Driver Interface) project is the 
result of a collaboration of several computer companies 
including Compaq, HP and IBM. 
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part of drivers and their interaction with the 
OS. 


Windows-specific driver generators like Blue- 
Water System’s WinDK [4] and NuMega’s 
DriverWorks [6] provide a graphical interface 
for specifying the main features of a driver. 
They produce a driver skeleton that consists of 
invocations of coarse-grained library functions. 
To our knowledge, no existing driver generators 
cover the communication with the device. 


Languages for specifying digital circuits and 
systems have existed for many years. The 
VHDL standard [11], widely used in this do- 
main, is one of the most expressive. It ad- 
dresses several aspects of chip design such 
as documentation, simulation and synthesis. 
VHDL provides both high-level and low-level 
abstractions: arrays and loops are supported, 
as well as bit-vector literals and bit extrac- 
tion. However, all VHDL abstractions focus on 
the inner workings of circuits, not their high- 
level programming interface. As a consequence, 
chip interfaces are not explicitly denoted, and 
VHDL compilers perform limited consistency 
checks. Interestingly, VHDL allows attaching 
arbitrary strings to variables. Using them to 
add interface-specific information is possible, 
but would require a normalized syntax and 
compiler support, which in some way amounts 
to embedding Devil concepts in VHDL. 


The New Jersey Machine-Code Toolkit [15] 
helps programmers write applications that pro- 
cess machine code at an assembly-language 
level of abstraction. Guided by a instruction 
set specification, the toolkit generates the code 
for reading or generating binary. Some simple 
verifications iare also done at the specification 
level. 


6 Conclusion and Future Work 


This paper has presented a new approach 
to developing hardware operating code that is 
based on an IDL named Devil. This IDL en- 
ables hardware communication to be described 
using high-level, domain-specific constructs in- 
stead of being written with assembly-language- 
like operations. Raising the implementation 
level of this layer of a device driver dramati- 


cally reduces the risk of errors. Devil has shown 
to be expressive enough to specify a wide va- 
riety of devices such as the DMA, interrupt, 
Ethernet, IDE disk, sound, mouse and video 
controllers. 


Because Devil significantly raises the level of 
abstraction of communication with the hard- 
ware, Devil specifications are more readable, 
maintainable and re-usable than equivalent C 
code. 


We have developed a compiler that checks 
the consistency of a Devil specification and 
automatically generates low-level code that is 
mostly comparable to hand-crafted code. We 
have assessed our approach by conducting ex- 
periments aimed at comparing hardware oper- 
ating code in C or Devil for robustness and per- 
formance. We have demonstrated that our ap- 
proach enables hardware operating code to be 
more robust than C, with mostly comparable 
performance. 


Our future work aims to improve the per- 
formance of the output of our Devil com- 
piler. Specifically, we want to enhance per- 
formance by factorizing and scheduling de- 
vice communications and by better exploit- 
ing special-purpose assembly-level instructions. 
The key advantage of introducing optimiza- 
tions at the compiler level is that these ad- 
vanced techniques are transparently available 
to any Devil programmer. As a result, our work 
reduces the need to have a highly experienced 
programmer to write hardware operating code 
since part of this expertise is captured by the 
compiler. 


We are currently building a public domain 
library of Devil specifications for common de- 
vices such as those found in PCs. Our purpose 
is to setup a WWW repository that would help 
dissemination of expertise about hardware and 
facilitate the development of device drivers. 
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Abstract 


Out-of-core applications consume physical resources at a rapid 
rate, Causing interactive applications sharing the same machine 
to exhibit poor response times. This behavior is the result of de- 
fault resource management strategies in the OS that are inappro- 
priate for memory-intensive applications. Using an approach that 
integrates compiler analysis with simple OS support and a run- 
time layer that adapts to dynamic conditions, we have shown that 
the impact of out-of-core applications on interactive ones can be 
greatly mitigated. A combination of prefetching pages that will 
soon be needed, and releasing pages no longer in use results in 
good throughput for the out-of-core task and good response time 
for the interactive one. Each class of application performs well 
according to the metric most important to it. In addition, the OS 
does not need to attempt to identify these application classes, or 
modify its default resource management policies in any way. We 
also observe that when an out-of-core application releases pages, 
it both improves the response time of interactive tasks, and also 
improves its own performance through better replacement deci- 
sions and reduced memory management overhead. 


1 Introduction 


Many of the computational problems of interest to sci- 
entists and engineers involve data sets that are much larger 
than physical memory [6, 7, 17]. Despite the continu- 
ing trend toward larger memories, it is unlikely that these 
data sets will ever fit entirely within main memory. In- 
creases in processor power and memory capacity make it 
feasible to solve larger problems, or to solve the same 
problem at a finer granularity, but the size of the data set 
grows with the problem being solved. For instance, input 
data sets for scientific visualization can currently exceed 
100 Gbytes [5]. For these “out-of-core” applications, I/O is 
required throughout the execution of the program to bring 
data into memory as it is needed and possibly to move it 
back out to disk. Performance concerns have traditionally 
forced programmers to explicitly manage the I/O in their 
out-of-core codes. Recently, however, we demonstrated 
that paged virtual memory can be enhanced with prefetch- 
ing to effectively hide the latency of page faults without 


placing any burden on the programmer [15]. In this ap- 
proach, the compiler provides information on future access 
patterns, the OS supports a simple prefetch/release inter- 
face, and a run-time layer improves performance by adapt- 
ing to dynamic behavior. 

While this earlier work demonstrated that out-of-core 
applications can achieve excellent performance on a ded- 
icated machine, it would be far more cost-effective if these 
tasks could coexist with other applications in a multipro- 
grammed environment. Unfortunately, out-of-core tasks 
have the potential to severely degrade the performance of 
other tasks which are attempting to use the machine at the 
same time. This problem arises because operating on mas- 
sive data sets consumes physical resources (memory and 
disk bandwidth) at a rapid rate, displacing the working sets 
of other applications and increasing their page fault ser- 
vice times. To make matters worse, successful prefetching 
causes physical resources to be consumed even faster, in- 
creasing the negative impact on other applications. 


1.1 Impact on Interactive Performance 


In many cases, the excessive resource consumption by 
out-of-core tasks is caused not by inherent resource re- 
quirements, but rather by sub-optimal resource manage- 
ment policies in the OS. While the default policies perform 
well in most cases, they are poorly suited to the demands 
of memory-intensive programs. For instance, most com- 
mercial operating systems use a global page replacement 
algorithm, which allows pages to be stolen from any ap- 
plication to satisfy page faults. Interactive tasks are par- 
ticularly vulnerable in such an environment since they are 
unable to defend their memory effectively. Consider an ed- 
itor program which may have no memory system activity 
for several seconds while it waits for user input. A pro- 
gram computing the inner product of two out-of-core vec- 
tors could easily sweep through all of physical memory in 
this time, stealing pages from the editor as they move to the 
head of the LRU queue. In this case, the out-of-core com- 
putation could have achieved the same performance using 
only two pages of physical memory, allowing the editor to 
remain responsive regardless of the intervening delay. 
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Figure 1. Impact of sharing the machine with an out-of-core 
matrix-vector multiplication (MATVEC) on the response time 
of an interactive task across a range of sleep times between 
touching 1 MB of data. 


To illustrate the impact of out-of-core applications on in- 
teractive performance, we ran the following experiment on 
a 4-processor SGI Origin 200 configured to have approxi- 
mately 75 MB of memory available to user programs.’ A 
simple program emulates the memory system behavior of 
an interactive task by repeatedly touching a | MB data set, 
then sleeping for a fixed amount of time. By varying the 
amount of sleep time we can control the frequency with 
which each page of the “interactive” task is accessed. The 
“response time” is the time to touch the entire data set. This 
program is run concurrently with one that repeatedly per- 
forms a matrix-vector multiplication on an out-of-core data 
set (400 MB). The results are shown in Figure 1. With 
no sleep time, the “interactive” task defends its memory 
extremely well, achieving the same response time as on a 
dedicated machine. As the sleep time increases, however, 
the task incurs an increasing number of page faults and the 
response time rises. When the out-of-core program uses 
prefetching, the response time begins to increase at much 
shorter sleep times, grows much faster, and rises to a higher 
level. Prefetching combined with global replacement puts 
the interactive task at a serious disadvantage. 

In recognition of the shortcomings of existing OS poli- 
cies, a significant amount of recent research has focused on 
customizable operating systems. While a customizable OS 
could provide the flexibility to tailor the resource manage- 
ment policies for out-of-core codes, our results in this pa- 
per demonstrate that we can achieve the desired outcome 
(i.e. customizable behavior) in this particular case through 
relatively modest extensions of today’s commercial operat- 
ing systems. To accomplish this goal, we adopt a strategy 
similar to our earlier work [15] in which the OS, compiler, 
and a run-time layer all cooperate. The role of the OS is to 
perform global resource allocation across all applications 
while the role of each out-of-core application (via the com- 


This amount of memory is artificially low for modern systems, but 
makes it possible to run experiments on out-of-core programs in areason- 
able amount of time. Similar behavior can be seen with more memory 
and larger out-of-core programs, although the time required to consume 
all physical memory increases with the amount of memory available 


piler and run-time layer) is to effectively manage the re- 
sources it has been granted. 


1.2 Objectives of This Study 


Inour earlier study [15], our focus was using prefetching 
to hide the J/O latency of out-of-core applications running 
on a dedicated machine. In this study, we focus on using 
release operations to manage physical memory intelligently 
within a multiprogramming workload that includes an out- 
of-core application. Although we introduced the concept of 
release operations in that earlier paper, we made little use of 
them because they offered no significant performance ben- 
efit to stand-alone out-of-core applications on the research 
prototype OS (Hurricane [21]) and machine (Hector [22]) 
that we used. Note that we observe a different result in this 
study using a modern commercial OS and machine. 

The primary contribution of this paper is that we pro- 
pose, implement, and evaluate a solution to the problem 
of preventing out-of-core applications from ruining the re- 
sponse time of interactive applications while still enjoy- 
ing the performance benefits of aggressive I/O prefetching. 
Our solution uses the compiler to automatically insert re- 
lease hints (in addition to prefetch hints) into the out-of- 
core application while a run-time layer and OS provide ap- 
propriate support. This approach requires minimal changes 
to existing operating systems and places no additional bur- 
den on the programmer. We implement our solution within 
a modern commercial system (an SGI Origin 200 running 
our modified version of IRIX 6.5) and evaluate its perfor- 
mance impact on both out-of-core applications and interac- 
tive tasks sharing the same machine. 

The remainder of this paper is organized as follows. Sec- 
tion 2 motivates allowing applications to manage their own 
resources, and describes the features we feel are needed to 
do so effectively. Section 3 describes the components of 
our system and their implementations. Section 4 presents 
our experimental results, and we discuss related work and 
draw conclusions in Sections 5 and 6. 


2 Memory Management Strategies 


The goal of a virtual memory management system in a 
multiprocessor environment is to share the physical mem- 
ory resources among all the competing applications. Most 
operating systems provide policies that perform well in the 
common case, but exhibit bad behavior when a memory- 
intensive program is sharing the machine with others. In 
this section we discuss why it may be beneficial to give de- 
manding applications control over their own memory man- 
agement, and examine some forms such control could take. 
Finally, we outline the features we believe are necessary 
for an effective system that allows applications to explic- 
itly manage their memory resources. 


2.1 Global vs. Local Replacement 


An out-of-core task can degrade the responsiveness of 
an interactive task because global replacement policies se- 
lect victims from among all the pages in the system with- 
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out regard to ownership. In contrast, a local page replace- 
ment strategy helps to isolate each process from the pag- 
ing activity of others. Each process is allocated a fixed 
set of physical pages and a victim is selected from among 
them as needed. Thus, interactive tasks would not have to 
worry about losing pages to a demanding out-of-core pro- 
gram. Unfortunately, poor memory utilization may occur, 
as pages are not allocated to processes according to their 
need. Attempting to determine the right number of pages 
to allocate to each process and dynamically adjusting this 
number during execution can improve memory usage but 
greatly complicates the OS. In practice, most workstation 
operating systems use global page replacement. 

Although local replacement policies can insulate pro- 
cesses from each other, they may not provide the best re- 
placement policy for each application. Rather than altering 
the overall strategy employed by the OS, itis preferable to 
modify individual applications so that their competition for 
physical resources better reflects their actual needs. This 
approach enables applications to improve their own per- 
formance through local replacement decisions that are su- 
perior to those used by the OS. The largest drawback of 
specializing applications to do memory management is the 
burden placed on the programmer; however, we propose 
a framework in which all the necessary modifications are 
performed automatically by a compiler. 


2.2 Application-Managed Replacement 


Giving specialized applications more control over their 
own memory management to improve their performance 
has been suggested before. For instance, the Mach OS sup- 
ports external pagers to allow applications to control the 
backing storage of their memory objects [18]. Extensions 
to the external pager interface have been used to implement 
user-level page replacement polices [14], and to support 
discardable pages (i.e. dirty pages that do not need to be 
written to backing store) [20]. In contrast, our approach 
shows that specialized applications can and should exploit 
extra control for the benefit of other applications execut- 
ing concurrently. This is especially true for programs that 
use prefetching to improve their own performance since the 
gains they enjoy impose a heavy penalty on other processes 
sharing the system. In this case, the OS could require that 
prefetching applications also explicitly release pages. 

Given that application-controlled memory management 
is desirable, one possibility is for the OS to allow applica- 
tions to choose from a small set of “reasonable” replace- 
ment policies. This strategy does not require much effort 
on the part of the application programmer, but also does 
not provide a great deal of power or flexibility. Another 
possibility is for the OS to provide a more general interface 
that allows applications to explicitly specify which of their 
pages can be reclaimed. This approach is preferable since 
individual applications can implement a variety of replace- 
ment policies tailored to their specific needs. 

Application management of memory resources through 
an interface that allows individual pages to be specified can 


be either reactive or pro-active. In a reactive approach, the 
OS notifies the application when one or more of its pages 
is about to be reclaimed. The application can then im- 
plement its own replacement policy by telling the system 
which pages to take. This is essentially the approach taken 
by the VINO page eviction extension [19], for example. A 
reactive system benefits applications that can make better 
replacement decisions than the default OS policy, and has 
the advantage of delaying the decision until memory actu- 
ally needs to be reclaimed. Unfortunately, it will not help 
isolate other applications from a memory-intensive one— 
the OS still decides which processes should give up pages. 

In a pro-active system, an application returns pages to 
the system before they are strictly required, either as soon 
as they are no longer needed or based on some other criteria 
such as the amount of free memory. A pro-active approach 
can obviate the need for the OS to steal pages by increasing 
the global pool of free memory, thus providing benefit to all 
applications sharing the system. Of course, the pro-active 
approach is not without potential cost to the application us- 
ing it. If the decision to release memory is made without 
full knowledge of future accesses, as is typically the case, 
then the application may give up pages that are still useful. 

Our goal is to develop a system that allows applications 
to pro-actively return memory to the system on a page-by- 
page basis, to the mutual benefit of themselves and other 
concurrently executing applications without placing any 
additional burden on the programmer. We now outline the 
elements that we believe are necessary to achieve this goal. 


2.3 Requirements for Effective Application- 
Directed Memory Management 
If applications are to manage their own memory usage, 
the first requirement is some form of support from the OS 
for this type of activity. Second, to automate memory man- 
agement without rewriting the application source code, we 
will need compiler analysis to detect access patterns and 
insert the necessary paging operations. Finally, since good 
replacement decisions will depend on dynamic conditions 
during program execution, we will need a run-time layer 
to intercept the information provided by the compiler and 
adapt the application’s behavior as required. 


2.3.1 Operating System Support 

The OS must supply both primitive operations and addi- 
tional information to applications. The operations should 
allow the application to specify the virtual memory ad- 
dresses that it will need in the future as well as those that 
it no longer needs. The additional information is needed 
to allow the application to make informed decisions about 
when memory management activity is required. It should 
include information about which virtual pages are currently 
in memory, how many pages are currently in use, and the 
upper limit on pages that the application should use. 


2.3.2 Compiler and Run-time Support 

To determine whether a given page should be released at 
a particular point, the compiler attempts to answer the fol- 
lowing questions. First, will the page be referenced again 
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Figure 2. Information flow between components of our system. 


in the future? If not, then a release hint is inserted. Second, 
does the number of other unique pages that will be accessed 
before the page is reused exceed the expected amount of 
available memory? If so, then the page is unlikely to re- 
main in memory, and a release hint is inserted. Otherwise, 
release hints are not inserted. 

There is a certain duality between the analysis for in- 
serting prefetches and releases. In both cases, the com- 
piler attempts to model when pages are being reused, and 
whether enough intervening accesses exist between these 
reuses to cause displacement. For prefetching, the ques- 
tion is whether a given page has remained in memory since 
its last reuse (if so, we do not need to insert a prefetch 
hint for it); for releasing, the question is whether a given 
page will remain in memory until its next reuse (in which 
case we do not want to release it). One difference, how- 
ever, is that prefetching uses this analysis only to min- 
imize overheads—the latency-hiding benefit of prefetch- 
ing depends only on scheduling prefetches early enough— 
whereas the benefit of release hints depends directly on the 
quality of this reuse analysis. 

Ideally, the compiler would be able to analyze the data 
accesses perfectly and insert these paging directives pre- 
cisely where they are needed. However, this ideal is not 
realistic for the following two reasons. First, one cannot al- 
ways predict memory access patterns with only static infor- 
mation. They may depend on run-time parameters (such as 
the problem size for the current run) or be data-dependent 
(such as the indirect references that often occur in sparse- 
matrix programs, e.g., a[b[i]]). While it is possible 
to issue prefetches for indirect references [8, 15], it is not 
possible to reason statically about any reuse that they may 
have, and hence it is not clear that the compiler can gen- 
erate useful release hints for them. The second major lim- 
itation of the compiler is that it decides when reuse can 
be exploited based on an assumption of how much mem- 
ory will be available to the application at run-time. In a 
multiprogrammed environment, such assumptions may be 
wildly inaccurate, especially since the amount of available 
memory may fluctuate dynamically during execution. 

For these reasons, it may be undesirable to actually re- 
lease a page at the point where the compiler has inserted the 
corresponding release hint. Instead, a run-time lay er should 
collect information about pages that could be released, ac- 
cording to the compiler-generated addresses, and actually 
perform the releases only when necessary. In addition to 
the addresses of releasable pages, the compiler should in- 
clude some indication of whether it believes the released 
pages will be used again or not. The role of the run-time 
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layer is to use the information provided by the OS and the 
compiler to answer the following questions: When should 
memory be returned to the OS? How many pages should be 
released? Which of the “releasable” pages should actually 
be given up? Figure 2 depicts the flow of information from 
the compiler and the OS to the run-time layer. 

The decision of when to release memory depends pri- 
marily on how close the application is to the upper limit 
on memory usage suggested by the operating system. The 
decision of how much memory to release is more compli- 
cated. The run-time layer needs to balance the desire to 
remain below the OS limit, the desire to retain as much 
memory as possible, and the desire to perform release op- 
erations as infrequently as possible to minimize overhead. 
For example, suppose the run-time layer detects that the ap- 
plication is close to its upper memory limit, and has knowl- 
edge of 1000 pages that could be released. By releasing all 
of these pages, the run-time layer increases the amount of 
time before it will have to act again, but it may have given 
up pages that would be used again in the future by acting 
too aggressively. The run-time layer should also consider 
the application’s expected future need for memory when 
deciding how much to release. If the application is close 
to the upper memory limit, but only needs a small num- 
ber of additional pages, the run-time layer may not need to 
release memory at all. Finally, once the run-time layer has 
determined that a release is necessary, and has decided how 
many pages to release, it must choose which pages should 
actually be returned to the OS. This decision depends on 
the expected future use of these pages; the run-time layer’s 
choice should be guided by information from the compiler. 

There are two situations that may arise from the compiler 
analysis. First, the compiler may have inserted release hints 
because it has determined that the page will not be reused 
again. The run-time layer should release these pages before 
any pages that are known to have reuse. Second, the com- 
piler may have detected that data reuse existed, but inserted 
release hints anyway because the volume of data accessed 
between reuses was expected to flush the page from mem- 
ory. For these pages, the run-time layer should perform 
releases according to the intrinsic data reuse (which can be 
revealed by the compiler), attempting to keep as much data 
in memory as possible for the subsequent accesses. For 
instance, suppose the application is repeatedly accessing 
an array that is much larger than physical memory. The 
run-time layer can implement most recently used (MRU) 
replacement once the memory usage approaches the upper 
limit set by the OS, thus keeping at least the first portionof 
the array in memory for future use. 


2.4 An Example 


To help illustrate these concepts, we now present a sim- 
ple example. Figure 3(a) shows the source code for a calcu- 
lation that averages an element of a matrix with its neigh- 
bors, while Figure 3(b) depicts the data elements that are 
touched during a single iteration of the innermost loop. 
The references have temporal reuse along the i dimension 
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(a) Source code for averaging nearest-neighbors 
for (i = 0; i < N; i++) 
for (j = 0; j < N; jtt) 
afi)(3]) = (aflit1][j-1] + alit+1) [3] 
+ afit+1] [j+1] + afi](j-1) + alilli] 
+ afi){j+1] + afli-1)[j-1] + ali-1] [4] 
+ afi-1] [j+1])/9.0; 


(b) View of data references to the matrix a 
ali)) 


Leading reference 
a{i+1}[j+1] 


“venting edge 


em 





Trailing teficrence 
ai-I}{j-1] 





j 


Figure 3. Example source code showing multiple references 


with different types of reuse, and graphical view of the data 
accesses during a single iteration of the innermost loop. 


(since the items accessed ata [i+1] [*] are touched again 
in the next iterations of the i-loop). There is spatial reuse 
along the j dimension, and there may also be spatial reuse 
along the i dimension, depending on the length of the rows. 

We can identify two major working sets in this access 
pattern. At the smallest level, we need to hold the leading 
edge of the data access square (those references indexed by 
j+1) in memory, requiring at most one page for each of the 
three references on this edge. Except at page boundaries, 
the references indexed by j-1 will fall on the same page 
as this leading edge due to spatial reuse. We therefore need 
at most six pages to fully exploit the spatial reuse along 
the j dimension. The second level working set exploits 
the temporal reuse along the i dimension, requiring us to 
hold three rows of the matrix in memory, so that the row 
first indexed by i+1 in one iteration will still be available 
for the i and i-1 references in the subsequent iterations. 
Of course, there is also a third level, which corresponds to 
keeping the entire matrix in memory. 

The compiler can determine precisely which references 
to prefetch and release if it has the dimensions of the ma- 
trix and a good estimate of the physical memory available. 
To successfully exploit the reuse across iterations of the i 
loop, we need to retain three rows of the matrix in mem- 
ory. If this is possible, then a prefetch will be inserted only 
for the leading reference, af i+1] [j+1], and a release 
will be inserted for the trailing reference, a[i-1] [j-1]. 
This corresponds to keeping the second level working set 
in memory. If the amount of memory needed to hold three 
rows is less than the amount available, the compiler will in- 
stead decide to prefetch all three references on the leading 
edge of the data access square (i.e. the a[i+1] [*] refer- 
ences) and release the references on the trailing edge, corre- 
sponding to the first level working set. If the dimensions of 
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the matrix are unknown at compile-time, the compiler must 
choose between these two options. Since over-estimating 
the ability of memory to retain data leads to missed op- 
portunities (both for prefetching and releasing), it is prefer- 
able to assume that only the smallest working set will fit in 
memory. The run-time layer is responsible for reducing the 
overhead of unnecessary operations that result. 

Having outlined the features that we believe are neces- 
sary to achieve a good pro-active user-level memory man- 
agement system, we turn now to a discussion of the specific 
components in our prototype system. 


3 Overview of Prototype System 


Our prototype system consists of three major compo- 
nents: extensions to the OS, a compiler analysis pass, and 
a run-time layer. We now describe these components. 


3.1 Implementation of OS Support 

We have implemented support for user-level paging di- 
rectives (i.e. prefetch and release) within the SGI IRIX 6.5 
operating system. IRIX 6.5 supports policy modules 
(PMs) that allow users to select various memory man- 
agement policies for page size, allocation, migration, and 
replication. A PM may be connected to any range of 
an application’s virtual address space, down to the level 
of a single page. We have defined a new PM-—called 
“PagingDirected”—that allows a user-level process to in- 
voke prefetch and release operations on pages of its address 
space. In addition, the PagingDirected PM shares informa- 
tion about memory usage with the application through a 
single 16KB page. 


3.1.1 Managing the Shared Page 

The shared page is allocated by the OS and mapped read- 
only into the application’s address space when the Pag- 
ingDirected PM is created. The page is used primarily as a 
bitmap, indexed by virtual page number, in which bits are 
set to indicate that the corresponding page is in memory, 
and cleared otherwise. The first two words in the page are 
reserved, however, to indicate the current number of pages 
in use by the process, and the upper limit on pages that the 
process should be using, respectively. 

All updates to the shared page are handled by the OS. 
When the PagingDirected PM is created, all bits in the 
shared page are initially set. When the application attaches 
the PM to a region of its virtual address space, the bits cor- 
responding to those addresses are all cleared. Thereafter, 
bits are set whenever a physical page is allocated for a vir- 
tual page associated with this PM, either due to prefetch 
requests or ordinary page faults. Bits are cleared when 
pages are reclaimed, either by an explicit release request 
or due to default page replacement activity. The estimates 
of current and maximum usage are updated only when the 
process experiences some type of memory system activ- 
ity, rather than every time the information changes. One 
consequence of this approach is that an application’s upper 
limit may drop dramatically if another process begins us- 
ing memory (reducing the total free memory in the system), 
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but the first process will not be informed of this change un- 
til it issues a prefetch/release request, page faults, or has 
memory stolen from it. The alternative approach of im- 
mediate updates would require the OS to either maintain 
a list of processes that should be informed, or to scan the 
list of all processes each time the amount of free memory 
in the system changes. This additional expense does not 
appear to be justified. Another alternative that we have not 
explored would be to notify interested applications if con- 
ditions change by more than a set threshold, rather than 
waiting for memory activity to occur. 


3.1.2 Handling Prefetch and Release Requests 

When the PagingDirected PM receives a request to 
prefetch a page, it performs actions similar to those that 
occur for a page fault, with two notable exceptions. First, 
if there is no free memory, the request is discarded imme- 
diately. This feature prevents memory from being stolen 
to satisfy prefetches when the demand for memory is high. 
Second, when the request completes, the prefetched page 
is not fully validated and no entry is made in the TLB. This 
feature prevents mappings for prefetched pages from dis- 
placing TLB entries which are still in use. 

Requests to release pages are handled by passing the 
addresses to a new system releasing daemon—called the 
releaser—which functions similarly to the paging daemon, 
but is specialized to reclaim only the pages indicated by 
the application. When a release request is made, the Pag- 
ingDirected PM clears the bits for the pages and enters the 
request in the releaser’s work queue. The releaser handles 
requests as they are received, first checking the bit vector 
to make sure that the pages have not been referenced again 
(either by a prefetch or a real reference) since the time of 
the request. The releaser then performs all actions needed 
to free the pages, including writing back dirty pages. Re- 
leased pages are placed at the end of the free list, giving 
pages that were released too early a chance to be rescued. 


3.1.3 Setting the Memory Limit 

The goal in setting the upper limit on memory usage is 
to prevent the default page replacement policies from be- 
ing activated, if at all possible. IRIX provides a number of 
tunable system parameters that control when pages will be 
stolen; these parameters can be also used by the Paging Di- 
rected PM in an effort to prevent such activity. First, the 
maximum number of pages that any process can have resi- 
dent in memory (max-rss) can be set. If a process exceeds 
this limit, the system paging daemon will attempt to trim 
physical pages from it. Second, the minimum number of 
pages that should be kept free (min_freemem) can be set. If 
tota] free memory falls below this limit, the paging daemon 
will steal pages from all processes in the system according 
to an approximation of an LRU policy. 

If physical memory is ample, it is sufficient to tell the 
process to remain below max_rss. When memory is lim- 
ited, the process should be encouraged to use no more 
than its current memory usage (current_size), plus the 
amount of free memory in the system (tot_freemem), less 
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minfreemem. The recommended upper limit on memory 
usage in our system is thus given as follows: 


upper limit = min(max-_rss, (current_size + 


tot freemem — minfreemem)) (1) 


Note that in setting this upper limit we are not guarantee- 
ing that the application will be able to allocate this many 
pages for itself. Instead, the upper limit is an indication of 
the number of pages for which the application is allowed to 
compete. Pages that have already been allocated to another 
process are not part of the global free memory pool and 
thus may not be acquired by the prefetching application. 
One result of this decision is that the upper memory limit is 
a moving target which is dynamically adjusted as the total 
demand for physical memory by all applications changes. 
Thus, the OS does not try to determine the “right” amount 
of memory to allocate to each process, it simply tells inter- 
ested processes how much memory is still available. Find- 
ing the right amount of memory for each process is beyond 
the scope of this paper. 


3.2 Implementation of Compiler Analysis 


We implemented our compiler algorithm as a pass in 
the SUIF (Stanford University Intermediate Format) com- 
piler [9]. This algorithm is an extension of the algorithm 
we developed earlier for inserting prefetching hints into 
array-based codes [15]; pointer-based data structures are 
not currently handled, although techniques used for cache 
prefetching may be applicable [13]. We now briefly de- 
scribe our algorithm. The following parameters are given 
to the compiler to describe the target system: the size of 
main memory, the page size, and the page fault latency. 
The compiler first uses reuse analysis to detect the intrinsic 
data reuses in the access patterns, then uses the page size 
and memory size parameters to apply locality analysis to 
predict when misses (i.e. page faults) are likely to occur. 
References that are likely to suffer page faults are isolated 
through loop splitting techniques, and prefetches for these 
references are scheduled based on the latency parameter 
using software pipelining. Figure 4 shows the process of 
creating the specialized executable from the original source 
code. The compiler analyzes each set of nested loops inde- 
pendently, thus reuses that occur between independent sets 
of loops are not considered. While the earlier algorithm 
did insert release hints in some cases, we have extended 
that analysis in two major ways: (i) we insert releases far 
more aggressively, and (ii) we encode reuse information 
into the release hints to allow the runtime layer to choose 
which pages to release first. 

Given the existing locality analysis, it is relatively 
straightforward to generate release operations. During lo- 
cality analysis, the compiler identifies groups of references 
that effectively share the same data and can be treated as 
a single reference—this is called “group locality”. For 
each of these groups (a group may contain only a sin- 
gle reference), the compiler identifies the leading reference 
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Figure 4. Steps in the automatic transformation of original application into prefetching/releasing executable. 


(i.e. the first reference to access the data) as the reference 
to prefetch—we simply extend this analysis to also iden- 
tify the trailing reference (the last one to touch the data) 
as the address to release. For indirect references (e.g., 
a[(b[i]]), we do not insert a release request since it is 
too hard to predict whether the data will be accessed again. 

In addition to identifying the addresses of data that can 
be released, the compiler also indicates whether the data 
has temporal reuse, and how soon the reuse is expected, 
based on the reuse analysis. (Recall that releases may be 
generated because the reuse is not expected to result in 
locality). The reuse information is encoded as a priority 
value which is passed as a parameter in the release requests; 
larger numbers represent references with earlier reuse—i.e. 
those which we would most prefer to retain in memory. The 
release priority is calculated as follows. Let depth(i) denote 
the depth of loop 7, with the outermost loop nest having a 
depth of 0. Let temporal(x) be the set of nested loops in 
which reference x has temporal reuse. The release priority 
is computed by the following equation: 


prisreye) = ayong eat) (2) 


4 € temporal(x) 


The run-time layer can use this information to prioritize 
which pages are actually returned to the system when the 
Memory usage approaches the upper limit, attempting to 
retain those pages that will be reused earlier to reduce the 
total amount of paging. 

Figure 5 shows an example of the output of our compiler 
for a set of loops that repeatedly perform a matrix-vector 
multiplication. The compiler analysis has determined that 
references to the b array have temporal reuse with respect 
to both the i-loop and the iter-loop, but that this reuse is 
not expected to result in locality since the volume of data 
accessed between reuses is more than the memory size pa- 
rameter. In contrast, references to the a array have tem- 
poral locality with respect to the iter-loop only. Both 
array references have spatial reuse (and locality) causing 
the compiler to schedule prefetches for the first reference 
to each page, and releases after the last reference to each 
page. Using equation (2), a release priority of | is assigned 
to the releases for the a array, and a priority of 3 is assigned 
to the releases for the b array, indicating that b’s pages will 
be reused before a’s pages. Neither prefetches nor releases 
are inserted for the c array since this item is smaller than a 
page and is expected to remain in memory. 
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(a) Original Code 


int a[100] {1000000}; 
int b[1000000]; 
int c[100]; 


for (iter = 0; iter < 10; iter++) 
for (i = 0; i < 100; i++) 
for (j = 0; j < 1000000; j++) 
cle) = ¢ [iy 4cala dig bliss 


(b) Code with Prefetch and Release 


for (iter = 0; iter < 10; iter++) { 
for (i = 0; i < 100; i++) { 
prefetch_block (&a[il[0], 56, 1, 0); 
prefetch_block (&b[0], 56, 3, 3); 
for (j1 = 0; j1 < 770048; j1 += 16384) { 
prefetch-release_block (&a[i) [245759 + j1), 
&ali}[j1-16384), 4, 1, 2); 
prefetch_release_block (&b[245759 + jl], 
&b[j1-16384], 4, 3, 5); 
for (j = jl; j < jl + 16384; j++) 


efi] = cli) + afil{j)*bij); 
} 
for (j = 770048; j < 1000000; j++) 
efi] = cli] + afil(j]*blj); 


release_block (Ga[i] [770048], 56, 1, 1); 
release block (&b[770048], 56, 3, 4);3 


} 
} 


Figure 5. Example of the output of the prefetching compiler. 
Arguments are: (prefetch address, release address, number 
of 16K8 pages, release priority, request identifier) 


3.3. Implementation of the Run-time Layer 


Figure 6 illustrates how prefetches and releases are pro- 
cessed by the run-time layer. To achieve the full bene- 
fit of prefetching, we need to be able to both fetch data 
asynchronously (so the application can continue after is- 
suing the prefetch) and take advantage of any available 
parallelism in the disk subsystem. The run-time layer ac- 
complishes these requirements by creating a number of 
pthreads ({\1] that make the actual calls to the PagingDi- 
rected PM and wait for the prefetches to complete. When 
a prefetch request inserted by the compiler is intercepted 
by the run-time layer, the bitvector is checked to see if 
a prefetch is really needed. If so, the request is placed 
on a work queue and one of the prefetching threads is 
signaled to handle the request. The prefetching threads 
simply remove requests from the queue and issue them to 
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(b) Buffering of release requests using tags and priorities assigned 
by the compiler. 


Figure 6. Handling prefetches and releases at run-time. 


the PagingDirected PM. We chose to use a pthreads-based 
approach since the IRIX kernel does not provide asyn- 
chronous I/O to user-level programs. Rather than attempt 
to add this functionality to IRIX, we chose an approach 
very similar to the implementation of the asynchronous I/O 
library in IRIX. 

The same set of pthreads are also used to actually is- 
sue the release requests to the OS. We have built run-time 
layers which implement two different policies for handling 
the release requests inserted by the compiler—one aggres- 
sively issues release requests to the OS at the time when 
they are encountered, while the other buffers releases based 
on the compiler-inserted priorities and only issues requests 
when necessary, based on the information provided by the 
OS. By comparing these two approaches, we can evaluate 
the usefulness of buffering release requests in the run-time 
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layer rather than simply relying on the compiler analysis. 

In both cases, the run-time layer attempts to reduce over- 
head by filtering out the obviously bad releases inserted by 
the compiler. There are two ways in which these bad re- 
leases are detected. First, the requests inserted by the com- 
piler are checked against the bitvector to make sure that the 
pages are in memory. Second, the run-time layer tracks 
the last address released for each unique release directive 
placed in the code, using the request identifier (or tag) gen- 
erated by the compiler. The first release request for any tag 
is recorded until the next request for that tag is issued. If a 
release request identifies the same page as the previous re- 
quest, it is dropped since the page is obviously still in use. 
If instead, the current release request identifies a different 
page, then the previously recorded release is actually han- 
dled and the current one is recorded. The releases issued 
by the run-time layer are thus always one or more iterations 
behind those identified by the compiler. Handling a previ- 
ously recorded request involves either placing it in a release 
queue (if buffering is being used), or issuing it to the OS. 
Programs with loop nests that have unknown bounds often 
cause the compiler to generate overly-aggressive code, and 
these simple checks help to reduce the overhead of releas- 
ing pages that are still in active use. 

Figure 6(b) shows how release requests are buffered. Re- 
quests with no reuse (i.e. a priority of 0) are issued to the 
OS after passing the simple checks. Other requests are 
stored in release queues indexed by their tags, allowing 
multiple buffered releases for a particular reference to be 
coalesced into a single entry in the queue. When the first re- 
lease for a tag is seen, the priority value is used to index into 
the priority list where a pointer is set to the release queue 
for that tag. The priority list can hold pointers to multiple 
queues having the same priority. When a release request 
is placed into one of the queues, the current memory us- 
age and memory limit are checked. If the current usage is 
close to the limit, the priority list is used to issue releases 
from the lowest-priority queues. Requests are issued from 
all queues at the same priority level in a round-robin fash- 
ion. Currently, the run-time layer attempts to release a total 
of 100 pages whenever releasing is deemed necessary. (We 
have not experimented with varying this parameter.) 

As we will show in Section 4, even the simple strategy 
of always issuing the releases improves the performance 
of the prefetching out-of-core application over prefetching 
alone, while simultaneously keeping memory free for other 
applications in most cases. When there is temporal reuse 
in an application, however, the advantages of prioritizing 
releases become clear. 


4 Experimental Results 


To evaluate the concepts presented in this paper, we ran 
several out-of-core applications with the simulated interac- 
tive task described in Section 1.1. We will first describe 
the platform used to obtain these results, then look at the 
impact of prefetching, alone and with both aggressive re- 
leasing and release buffering, on the execution time of the 
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Table 1. Experimental platform characteristics. 


Processor 

MIPS R10000 
Number of Processors: 4 
Clock rate: 180 MHz 


Processor type: 





Physical Memory 
Total size: 128 MBytes 
Available to application: 75 MBytes 
Page size: 16 KBytes 





Manufacturer: Scagate 
Model: Cheetah 4LP 
Number of disks used for swap: 10 
Maximum external (I/O) transferrate: | 40 Mbytes/sec/disk 

2.99 msec 
18 msec (typical) 
19 msec(typical) 


Average rotational latency: 
Track-to-track seek, read: 

Track-to-track seek, write: 
Number of SCSI controllers: 5 
Disks per controller: 2 





out-of-core program. To explain the basic performance re- 
sults, we will then take a closer look at the effectiveness 
of the release operation by examining the activity in the 
virtual memory subsystem. Finally, we evaluate the use- 
fulness of explicitly releasing memory for improving the 
response time of the interactive task. 


4.1 Hardware Platform 


Our experimental] results were obtained on a 4-processor 
SGI Origin 200, running our modified version of the 
IRIX 6.5 operating system. The system was configured so 
that approximately 75MB of physical memory was avail- 
able to user programs, and the system swap space was 
striped across ten Seagate Cheetah 4LP disks using raw 
swap partitions. Five SCSI adapters each control two of 
these ten disks; the SCSI adapters are in turn connected to 
the PCI buses on the Origin. The basic hardware character- 
istics of our system are summarized in Table 1. 


4.2 Benchmarks 


We performed our experiments using out-of-core ver- 
sions of five applications taken from the NAS Parallel 
benchmark suite [1] as well as a matrix-vector multiplica- 
tion kernel (MATVEC). The code for MATVEC was shown 
earlier in Figure S(a). We have increased the data sets of 
the NAS benchmarks to make them larger than the avail- 
able memory on our system. Other than increasing the data 
set sizes, we did not modify these applications by hand in 
any way—all prefetch and release operations were inserted 
automatically by our compiler pass. 

Table 2 summarizes the characteristics of these applica- 
tions; each exhibits different data access behavior. EMBAR 
has only one-dimensional loops, while MATVEC has multi- 
dimensional loops with known bounds. For both, the com- 
piler analysis is essentially perfect and excellent results are 
obtained for both the benchmarks themselves and the inter- 
active task. BUK and CGM are more difficult cases, as they 
involve both unknown loop bounds and indirect references, 
both of which reduce the compiler’s ability to analyze the 
data accesses. Nonetheless, the run-time layer is able to 
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adapt the behavior based on dynamic conditions and excel- 
lent results are again achieved. MGRID and FFTPDE are 
the most difficult cases. Both involve multi-dimensional 
loops with unknown bounds. In MGRID the loop bounds 
change dynamically on different calls to the same proce- 
dures, making it impossible to release memory optimally 
in all cases, since we only generate a single version of the 
code. In FFTPDE, the access stride changes within a set 
of loops, making it seem as though the access is not de- 
pendent on the loop induction variable. This causes the 
compiler to identify some releases as having reuse when 
in fact none exists. Ultimately, the solution to the prob- 
lems experienced by MGRID and FFTPDE is to generate 
more adaptive code, and specialize the loops at run-time 
according to dynamic conditions. Even without this extra 
sophistication, MGRID performs better with releases and 
can significantly reduce (although not eliminate) its nega- 
tive impact on interactive response time. We believe that 
any additional improvements to the results shown here will 
come from improved compiler analysis and code genera- 
tion, and greater run-time layer involvement, rather than 
from additional operating system support. 





4.3 Performance of the Out-of-Core Applications 

The goal of I/O prefetching is to improve the execution 
time of out-of-core applications by hiding the page fault 
latency. The goals of explicitly releasing memory are to 
reduce the number of page faults in out-of-core programs 
by making better replacement decisions, to reduce the in- 
terference caused by the OS selecting victims for replace- 
ment, and to alleviate the impact of out-of-core programs 
on other applications sharing the same system. We begin 
by examining how well our scheme achieves these goals 
from the perspective of the out-of-core applications. 

In Figure 7, we show the execution times of the out-of- 
core programs, normalized to the original case. For each 
benchmark we show four bars: the original, unmodified 
program (OQ), the program compiled to use prefetching only 
(P), the program compiled to use both prefetching and ag- 
gressive releasing (R), and the program compiled to use 
both prefetching and release buffering (B). Each bar is bro- 
ken down into four components. The top section is the time 
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Figure 7. Impact of prefetching and releasing on the execu- 
tion times of the out-of-core applications. (O = original, P = 
with prefetching, R = with prefetching and releasing, B = with 
prefetching and release buffering) 


that the program was stalled waiting for I/O. The next com- 
ponent is the time that the process was stalled waiting for 
unavailable resources, including physical memory, mem- 
ory system locks, and CPUs. The second-lowest compo- 
nent is the system time, which is primarily spent handling 
page faults. The bottom section of each bar is the time 
spent executing user code. Increases in user time over the 
original case show the overhead of handling prefetch and 
release requests in the run-time layer. Because we use sep- 
arate threads to issue the prefetch requests, the prefetch ser- 
vice time does not appear in the execution time of the main 
application. Since we are using a multiprocessor, many of 
the prefetches can be serviced in parallel. Although the 
prefetch threads compete with the main application and the 
interactive task for CPU time, it is a very small effect since 
these threads spend most of their time waiting for I/O. 


All prefetching versions of the benchmarks achieve sim- 
ilar reductions in the I/O stall time, with over 85% of the 
1/O stall eliminated in all cases. The time spent executing 
system code is nearly identical across all versions of the 
benchmarks, and only modest increases in user time oc- 
cur in the prefetching versions. The increase in user time 
is most pronounced for CGM, where a very large num- 
ber of unnecessary prefetch and release requests need to 
be filtered out by the run-time layer. These unnecessary 
requests are the result of the compiler’s inability to rea- 
son about the amount of data accessed in loops with un- 
known bounds. For CGM, most of these loops are small 
and prefetches and releases are not needed. In all cases 
except for FFTPDE and MATVEC, the results for aggres- 
sive releasing and release buffering are very similar, since 
these applications do not have temporal reuse within a sin- 
gle set of loops, and the compiler analysis is unable to de- 
tect reuse across independent sets of loops. When all re- 
lease requests have zero-priority, both implementations of 
the run-time layer perform the same actions (issuing the re- 
quests to the OS without buffering), although the version 
which attempts to buffer requests incurs a small amount of 
additional overhead to check the priorities. In FFTPDE, the 
compiler incorrectly identifies some references as having 
temporal reuse, causing the run-time layer to preferentially 
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retain these pages in memory to the detriment of others. 
For MATVEC, however, the benefit of buffering and priori- 
tizing releases is dramatic. In this case, without buffering, 
both the matrix and the vector are released, but the vector 
is frequently reused shortly thereafter. Large amounts of 
contention occur between the release daemon attempting to 
free the pages of the vector and the application attempting 
to reclaim them. When the run-time layer buffers and pri- 
oritizes the releases, only the pages of the matrix need to be 
released and contention is greatly reduced. In the remain- 
der of this section, we will discuss both releasing versions 
of the benchmarks together, since their behavior is essen- 
tially the same, making specific reference to MATVEC in 
the cases where buffering makes a difference. 

The I/O stall reductions, and the system time and user 
time components of these experiments all validate the re- 
sults we obtained in our previous study on compiler-based 
I/O prefetching [15], demonstrating that these techniques 
are still applicable with modern hardware and software. 
Our prior study, however, showed that releasing memory 
provided no significant benefit to the out-of-core applica- 
tions over prefetching alone. One key difference here is 
that the earlier compiler implementation did not insert re- 
lease operations in many situations. Our results here, in 
contrast, show that there is a substantial reduction in the 
execution time of the out-of-core applications when releas- 
ing is applied aggressively. The speedups from applying 
both prefetching and releasing over prefetching alone range 
from 13% for EMBAR to over 50% for CGM. This added 
benefit is rather unexpected, both because it did not occur 
in the previous study, and because the run-time layer imple- 
mentations are not trying to actively improve the replace- 
ment policy (since there is no known reuse)—they simply 
try to maintain as large a pool of free memory as possi- 
ble by releasing pages which the application apparently no 
longer needs. There are essentially three reasons for the 
improvement due to aggressive releasing: (i) a reduction in 
the number of soft page faults caused by the paging daemon 
attempting to identify unused pages; (ii) a reduction in the 
contention for memory locks needed by both the fault han- 
dling code and the paging daemon; and (iii) improvements 
in the replacement policy created by the compiler analysis 
alone. We now discuss the impact of each of these effects. 

Looking at the components of the bars in Figure 7, we 
see that the greatest difference between the prefetching- 
only and the two prefetching-and-releasing cases is in the 
time stalled for unavailable resources. Without releasing, 
the paging daemon needs to determine which pages should 
be reclaimed. To do so, a variant of a clock algorithm is 
used, in which pages can be reclaimed if they have not been 
referenced for a number of passes of the clock hand. Since 
the MIPS TLB does not have reference bits, reference in- 
formation must be simulated in software using the valid bit 
instead. As free memory becomes low, pages are period- 
ically marked invalid to see if they are still in use. These 
invalidations increase the number of soft page faults as the 
process references, and needs to re-validate, the pages that 
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Table 3. Pages freed by system or by release, and pages rescued from the free list. 
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Figure 8. Soft page faults due to page invalidations. 


were still in its working set. However, with aggressive re- 
leasing, the paging daemon does not need to find pages to 
reclaim, thus greatly reducing the number of invalidations. 

Figure 8 shows the number of page faults caused by these 
periodic invalidations for each version of our out-of-core 
benchmarks. Not only are the number of soft page faults 
greater when prefetching is used withoutreleasing, the time 
to service each of these faults is also amplified due to in- 
creased contention for locks between the paging daemon 
and the fault handling code. The time to handle hard page 
faults is also increased by this contention. When the paging 
daemon needs to invalidate or reclaim pages, it holds locks 
on the address spaces of the processes from which pages 
are being stolen. During this time, page faults for these vir- 
tual memory regions cannot be serviced. The releasing dae- 
mon must hold the same locks while freeing the explicitly 
released pages; however, it typically operates on smaller 
blocks of pages, so the locks can be held for much shorter 
periods of time. Furthermore, the releasing daemon has 
been specialized for the purpose of freeing pre-identified 
pages. Thus, it requires fewer locks overall and can do 
much less processing per page while locks are held. The 
resulting lock contention caused by the releasing daemon 
is significantly less than that caused by the paging daemon. 

Finally, in some cases the compiler analysis is able to 
improve upon the replacement policy without extra sup- 
port from the run-time layer. In BUK, the data set consists 
of two very large sequentially-accessed arrays and a third 
equally large randomly-accessed array. The compiler in- 
serts releases for the first two, but does not try to release 
the third because it cannot reason about any locality that 
may exist. The result is that demand for new pages is satis- 
fied by the releases of the first two arrays and the pages of 
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the third array are able to remain mostly in memory. With- 
out releasing, the paging daemon reclaims pages from all 
three arrays according to their last use, but without regard 
to their access patterns, causing many more page faults to 
occur. Although the run-time layer is not able to prioritize 
releases due to a lack of temporal reuse, the decision by the 
compiler to not release randomly accessed data effectively 
accomplishes the desired effect. Having discussed the over- 
all performance impact of our system, we now take a closer 
look at how effective the compiler and run-time layer are at 
generating and managing releases. 


4.4 Effectiveness of Releases 

There are two considerations when evaluating the effec- 
tiveness of the release operation. First, the purpose of is- 
suing releases is to maintain a large enough pool of free 
memory to prevent the default page reclamation behavior. 
To see how well we achieve this goal, we look at how much 
work the paging daemon performs, both with and without 
releases. Second, we should only be releasing pages that 
are really no longer in use by the application (or will not 
be used again for a long time) to avoid increasing the page 
fault rate. To see how useful the releases are, we look at 
how many released pages are “rescued” from the free list 
(i.e. returned to the process that was using it). If we are ac- 
tually releasing pages that are no longer needed, very few 
pages should be rescued. The page reclamation and alloca- 
tion activity is summarized in Table 3 for the original out- 
of-core programs and the versions that both prefetch and 
release memory without buffering. 

From Table 3, we see thatreleases are usually very effec- 
tive at reducing the need for the paging daemon to reclaim 
memory. In the worst case, the number of times that the 
paging daemon needs to operate is reduced by more than 
half, and the total number of pages stolen is reduced by 
more than a factor of three. In the other cases, the activ- 
ity of the paging daemon is reduced by one to two orders 
of magnitude, both in terms of frequency and number of 
pages stolen. Although it is very difficult for the applica- 
tion to release its pages perfectly, it can still provide a great 
deal of assistance to the OS. 

Next we look at how often useful pages are reclaimed 
too early, either by the paging daemon or due to explicit 
release requests. There are two possibilities. First, useful 
pages may still be on the free list when they are referenced 
again, and can be rescued and returned to the application. 
Second, useful pages may have been re-allocated to hold 
other data before being referenced again, and the reused 
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Figure 9. Breakdown of outcomes for freed pages. 


data will need to be brought back into memory from swap. 


Figure 9 shows what fraction of all the pages freed are 
freed by the paging daemon vs. the fraction freed explicitly 
by release requests. We also show the fraction of each that 
are rescued from the free list. The interesting cases here are 
BUK, MGRID and MATVEC. As we see in Figure 9, BUK 
without any releasing (both the original and prefetching 
versions) frequently needs to rescue the pages reclaimed 
by the paging daemon from the free list. The greater de- 
mand on memory introduced by prefetching increases the 
need for the paging daemon to reclaim memory, resulting 
in useful pages being placed on the free list more often. 
Consequently, the fraction of reclaimed pages that are res- 
cued also increases. With releasing, however, most of the 
pages are freed by explicit release requests and very few are 
rescued from the free list. In this case, releasing helps the 
application to retain its most-needed pages in memory. For 
MGRID, we see that even with releasing, over half of the 
pages freed are reclaimed by the paging daemon, and that 
more than half of the pages explicitly released are rescued 
from the free list. This suggests that the compiler is unable 
to determine which pages to release and when for MGRID. 
Note also that FFTPDE withrelease buffering performs very 
few useful releases due to incorrectly attempting to retain 
pages with no reuse. For MATVEC without releasing, the 
OS does a reasonable job of freeing the pages of the ma- 
trix and keeping the frequently accessed vector in memory. 
With aggressive releasing, however, approximately half of 
the pages released are for the vector and need to be rescued 
from the free list. When release buffering is used, most of 
the released pages are for the matrix, and the number of 
rescued pages is much smaller. Overall, we can see that re- 
leasing greatly reduces the need for the paging daemon to 
reclaim memory, and typically does a good job of releasing 
pages that are no longer in use. 


Detecting pages that were freed too early and re- 
allocated before they could be rescued is a more difficult 
task. These pages will increase the total number of page al- 
locations required (over the ideal) as new pages are needed 
to bring the reused data back into memory. While we can- 
not compare the total number of page allocations to the 
ideal number, we can look at the number of allocations 
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(c) Average number of page faults requiring I/O for the interactive 
task with each out-of-core benchmark. 


Figure 10. impact of releasing on Interactive response time. 


in the original case versus the prefetching-and-releasing 
cases. From Table 3, we see that the total number of page 
allocations increases by a small amount with prefetching 
and releasing in half of the cases, and decreases by a small 
amount in the other half. This suggests that releasing is 
typically doing no worse at freeing needed pages than the 
paging daemon, but results in much less contention. 

We now look at how useful releases are for improving 
the performance of the interactive task. 


4.55 Impact on Interactive Response Time 

Figure 10 gives an overview of the performance im- 
provements obtained for the “interactive” task. In Fig- 
ure 10(a), we show the average response time for the in- 
teractive task when executed concurrently with MATVEC 
across a range of sleep times. As discussed in Section 1.1, 
the response times become greatly inflated when the out-of- 
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core program executes normally, and are made even worse 
when prefetching alone is used. When releasing is added to 
prefetching, however, the response times of the interactive 
task almost perfectly matches the times obtained when it is 
run alone on the machine, regardless of the amount of sleep 
time. Although blindly following the release directives in- 
serted by the compiler has a severe effect on MATVEC’s 
own performance, this strategy does leave most of mem- 
ory free for the interactive task. However, when release 
buffering is used to improve the performance of MATVEC, 
there is still nearly no impact on the interactive task. The 
run-time layer is able to both buffer releases for the ben- 
efit of the out-of-core task and keep enough memory tree 
for the interactive one. The negative impact of the out-of- 
core program on the response time of the interactive task 
in this case has been almost completely eliminated. For 
the other out-of-core applications, we chose an intermedi- 
ate sleep time of five seconds for the interactive task and 
recorded the average response times. The results for each 
of the four versions of the out-of-core programs are shown 
in Figure |0(b). The response times in this graph have been 
normalized to the time for the interactive task executing 
alone on the machine. As we see in Figure !0(b), releasing 
is usually successful at eliminating or substantially reduc- 
ing the degradation in interactive response time. FFTPDE 
with release buffering is the exception as this benchmark 
fails to release enough memory. 

Figure 10(c) shows the average number of hard page 
faults (i.e. those that require I/O) experienced by the in- 
teractive task during a single sweep through its data set, 
when it is executed concurrently with each version of our 
out-of-core benchmarks. From this table, we see that the 
number of page faults increases when the out-of-core pro- 
gram uses prefetching alone, rising to the maximum level 
of 65 pages. At this point, the entire data set of the inter- 
active task must be paged in from the swap space. When 
the out-of-core program also releases pages, the number of 
hard page faults is significantly reduced. This result ver- 
ifies that the primary reason for the increased interactive 
response time is not being able to keep pages in memory. 


5 Related Work 


Many researchers have suggested that better perfor- 
mance can be obtained if sophisticated applications are 
given control over their own memory management deci- 
sions. Most previous work in this area has focused on 
how the OS can provide this functionality to the applica- 
tions. For instance, the Mach operating system supports 
external pagers to allow applications to control the back- 
ing storage of their memory objects [18]. Extensions to the 
external pager interface have been used to implement user- 
level page replacement policies [14] and to support discard- 
able pages (i.e. dirty pages that do not have to be written to 
backing store) [20]. More aggressive application control 
of physical memory was implemented in the V++ kernel 
by Harty and Cheriton [10]. In their scheme, the applica- 
tion was given complete control over a cache of physical 


pages, enabling the implementation of application-specific 
memory management policies. Giving applications more 
control over physical resources (not just memory) is also 
a part of the motivation behind extensible operating sys- 
tems such as Exokernel [12], SPIN [2], and Vino [19]. Pro- 
viding support for application-specific control is only half 
of the picture, however. If the mechanisms provided re- 
quire programmers to re-write their applications manually, 
the full power of the scheme is unlikely to be realized in 
the real world. In contrast, our approach provides not only 
the mechanisms for application-controlled memory man- 
agement, but also a means to leverage these mechanisms 
automatically through the use of the compiler. 

Other related work has shown the importance of consid- 
ering both prefetching and replacement decisions in tan- 
dem, in the context of I/O prefetching for file system ref- 
erences. Cao et al. [3] present several properties that op- 
timal prefetching and caching strategies must have; how- 
ever the complete reference stream is required to satisfy 
these properties. The TIP system for I/O prefetching by 
Patterson et al. [16] uses a cost-benefit model to estimate 
which file blocks should be replaced from the buffer cache, 
based on access-pattern hints disclosed by the application. 
While the goal of using application-specific knowledge to 
improve overall system performance is the same as in our 
system, we focus on virtual memory references rather than 
file reads and writes. In the original TIP implementation, 
applications had to be manually modified to generate the 
necessary access hints. Recently, another approach for au- 
tomatically modifying applications to provide hints about 
their future accesses has been presented by Chang and Gib- 
son [4]. Applications are modified automatically (using 
a binary modification tool on the program executable) to 
speculatively execute the code and generate access pattern 
hints to be passed to the TIP system. Because it is much 
more costly to track all virtual memory references (versus 
explicit file requests only) the techniques used by the TIP 
system for deciding what to eject from the file cache are 
not especially applicable for virtual memory management. 


6 Conclusions 


We have implemented and evaluated a complete and 
fully-automatic system which exploits compiler-inserted 
release operations to intelligently manage the physical 
memory resources of out-of-core applications. These spe- 
cialized applications can reduce their impact on the per- 
formance of other applications while still exploiting ag- 
gressive prefetching to hide their I/O latency. Our results 
confirm that compiler-inserted I/O prefetching works well 
on commercial operating systems and state-of-the-art ma- 
chines (even though faster processors make it much more 
challenging to hide the I/O latency), hiding roughly 85- 
100% of the I/O stall time in our out-of-core benchmarks 
and achieving good overall speedups. 

The significant benefit to the out-of-core benchmarks 
due to aggressively releasing memory was mostly unex- 
pected. In BUK we expected to see a benefit from improv- 
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ing on the replacement policy, but for the other applications 
(excepting MATVEC, which is hurt by aggressive releas- 
ing), the improvement comes from reducing the interfer- 
ence between the operating system and the application. We 
found the extent of this interference between the paging 
daemon and the page fault handling to be especially sur- 
prising. Not only does the paging daemon greatly increase 
the number of soft page faults as it attempts to simulate 
reference bits in software, but the time to handle these page 
faults is also inflated by increased lock contention. Because 
the overhead of determining which pages to replace is so 
large, explicit replacement hints can improve performance, 
even if they are not making better replacement decisions 
than the default policy. It would be interesting to see if 
these benefits still occur on a system with hardware refer- 
ence bits (although such a study was beyond the scope of 
this paper since IRIX only runs on MIPS processors). 

Overall, our compiler-based approach for combining 
both prefetching and releasing to allow out-of-core appli- 
cations to explicitly manage their virtual memory is a situ- 
ation in which everyone wins. Both the memory- intensive 
programs and the less demanding interactive ones sharing 
the system obtain performance benefits. Only the out-of- 
core programs need to be modified, and the changes are 
performed automatically by the compiler without burden- 
ing the application programmer. Furthermore, the default 
policies of the operating system do not need to be changed, 
and no overhead is introduced in the common case for man- 
aging ordinary applications. 
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Abstract 


In this paper, we present surplus fair scheduling (SFS), 
a proportional-share CPU scheduler designed for sym- 
metric multiprocessors. We first show that the infeasibil- 
ity of certain weight assignments in multiprocessor envi- 
ronments results in unfairness or starvation in many ex- 
isting proportional-share schedulers. We present a novel 
weight readjustment algorithm to translate infeasible 
weight assignments to a set of feasible weights. We show 
that weight readjustment enables existing proportional- 
Share schedulers to significantly reduce, but not elimi- 
nate, the unfairness in their allocations. We then present 
surplus fair scheduling, a proportional-share scheduler 
that is designed explicitly for multiprocessor environ- 
ments. We implement our scheduler in the Linux ker- 
nel and demonstrate its efficacy through an experimen- 
tal evaluation. Our results show that SFS can achieve 
proportionate allocation, application isolation and good 
interactive performance, albeit at a slight increase in 
scheduling overhead. We conclude from our results that 
a proportional-share scheduler such as SFS is not only 
practical but also desirable for server operating sys- 
tems. 


1 Introduction 


1.1. Motivation 


The growing popularity of multimedia and web applica- 
tions has spurred research in the design of large multi- 
processor servers that can run a variety of demanding ap- 
plications. To illustrate, many commercial web sites to- 
day employ multiprocessor servers to run a mix of HTTP 
applications (to service web requests), database applica- 
tions (to store product and customer information), and 
streaming media applications (to deliver audio and video 
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content). Moreover, Internet service providers that host 
third party web sites typically do so by mapping mul- 
tiple web domains onto a single physical server, with 
each domain running a mix of these applications. These 
example scenarios illustrate the need for designing re- 
source management mechanisms that multiplex server 
resources among diverse applications in a predictable 
manner. 

Resource management mechanisms employed by a 
server operating system should have several desirable 
properties. First, these mechanisms should allow users 
to specify the fraction of the resource that should be al- 
located to each application. In the web hosting exam- 
ple, for instance, it should be possible to allocate a cer- 
tain fraction of the processor and network bandwidth to 
each web domain [2]. The operating system should then 
allocate resources to applications based on these user- 
specified shares. It has been argued that such allocation 
should be both fine-grained and fair [3, 9, 15, 17, 20]. 
Another desirable property is application isolation—the 
resource management mechanisms employed by an op- 
erating system should effectively isolate applications 
from one another so that misbehaving or overloaded ap- 
plications do not prevent other applications from receiv- 
ing their specified shares. Finally, these mechanisms 
should be computationally efficient so as to minimize 
scheduling overheads. Thus, efficient, predictable and 
fair allocation of resources is key to designing server op- 
erating systems. The design of a CPU scheduling algo- 
rithm for symmetric multiprocessor servers that meets 
these objectives is the subject matter of this paper. 


1.2 Relation to Previous Work 


In the recent past, a number of resource management 
mechanisms have been developed for predictable allo- 
cation of processor bandwidth [2, 7, 11, 12, 14, 16, 18, 
24,28]. Many of these CPU scheduling mechanisms as 
well as their counterparts in the network packet schedul- 
ing domain [4, 5, 19, 23] associate an intrinsic rate with 
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each application and allocate resource bandwidth in pro- 
portion to this rate. For instance, many recently pro- 
posed algorithms such as start-time fair queuing (SFQ) 
(9], borrowed virtual time (B VT) [7], and SMART [16] 
are based on the concept of generalized processor shar- 
ing (GPS). GPS is an idealized algorithm that assigns 
a weight to each application and allocates bandwidth 
fairly to applications in proportion to their weights. GPS 
assumes that threads can be scheduled using infinitesi- 
mally small quanta to achieve weighted fairness. Practi- 
cal instantiations, such as SFQ, emulate GPS using finite 
duration quanta. While GPS-based algorithms can pro- 
vide strong fairness guarantees in uniprocessor environ- 
ments, they can result in unbounded unfairness or star- 
vation when employed in multiprocessor environments 
as illustrated by the following example. 


Example 1 Consider a server that employs the start- 
time fair queueing (SFQ) algorithm [9] to schedule 
threads. SFQ is a GPS-based fair scheduling algorithm 
that assigns a weight w; to each thread and allocates 
bandwidth in proportion to these weights. To do so, SFQ 
maintains a counter S; for each application that is incre- 
mented by - every time the thread is scheduled (q is the 
quantum duration). At each scheduling instance, SFQ 
schedules the thread with the minimum Sj, on a proces- 
sor. Assume that the server has two processors and runs 
two compute-bound threads that are assigned weights 
w, = 1 and we = 10, respectively. Let the quantum 
duration be q = 1ms. Since both threads are compute- 
bound and SFQ is work-conserving,' each thread gets to 
continuously run ona processor. After 1000 quantums, 
we have S; = rowe = 1000 and S2 = 100 = 100. 
Assume that a third cpu-bound thread arrives at this 
instant with a weight w3; = 1. The counter for this 
thread is initialized to S3 = 100 (newly arriving threads 
are assigned the minimum value of S; over all runnable 
threads). From this point on, threads 2 and 3 get con- 
tinuously scheduled until Sy and S3 “catch up” with Sy. 
Thus, although thread 1 has the same weight as thread 
3, it starves for 900 quanta leading to unfairness in the 
scheduling algorithm. Figure 1 depicts this scenario. 


Many recently proposed GPS-based algorithms such as 
stride scheduling [28], weighted fair queuing (WFQ) 
[18] and borrowed virtual time (BVT) [7] also suffer 
from this drawback when employed for multiprocessors 
(like SFQ, stride scheduling and WFQ are instantiations 
of GPS, while BVT is a derivative of SFQ with an ad- 
ditional latency parameter; BVT reduces to SFQ when 
the latency parameter is set to zero). The primary reason 
for this inadequacy is that while any arbitrary weight 
assignment is feasible for uniprocessors, only certain 


‘A scheduling algorithm is said to be work-conserving if it never 
lets a processor idle so long as there are runnable threads in the system. 
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weight assignments are feasible for multiprocessors. In 
particular, those weight assignments in which the band- 
width assigned to a single thread exceeds the capacity 
of a processor are infeasible (since an individual thread 
cannot consume more than the bandwidth of a single 
processor). In the above example, the second thread was 


assigned igth of the total bandwidth on a dual-processor 
server, whereas it can consume no more than half the to- 
tal bandwidth. Since GPS-based work-conserving algo- 
rithms do not distinguish between feasible and infeasible 
weight assignments, unfairness can result when a weight 
assignment is infeasible. In fact, even when the ini- 
tial weights are carefully chosen to be feasible, blocking 
events can cause the weights of the remaining threads to 
become infeasible. For instance, a feasible weight as- 
signment of 1:1:2 on a dual-processor server becomes 
infeasible when one of the threads with weight 1 blocks. 
Even when all weights are feasible, an orthogonal prob- 
lem occurs when frequent arrivals and departures pre- 
vent a GPS-based scheduler such as SFQ from achieving 
proportionate allocation. Consider the following exam- 
ple: 


Example 2 Consider a dual-processor server that 
runs a thread with weight 10,000 and 10,000 threads 
with weight I. Assume that short-lived threads with 
weight 100 arrive every 100 quantums and run for 100 
quantums each. Note that the weight assignment is al- 
ways feasible. If SFQ is used to schedule these threads, 
then it will assign the current minimum value of S; in 
the system to each newly arriving thread. Hence, each 
short-lived thread is initialized with the lowest value of 
S; and gets to run continuously on a processor until it 
departs. The thread with weight 10,000 runs on the other 
processor; all threads with weight I run infrequently. 
Thus, each short-lived thread with weight 100 gets as 
much processor bandwidth as the thread with weight 
10,000 (instead of ia of the bandwidth). Note that this 
problem does not occur in uniprocessor environments. 


The inability to distinguish between feasible and in- 
feasible weight assignments as well as to achieve pro- 
portionate allocation in the presence of frequent ar- 
rivals and departures are fundamental limitations of a 
proportional-share scheduler such as SFQ. Whereas ran- 
domized schedulers such as lottery scheduling [27] do 
not suffer from starvation problems due to infeasible 
weights, such weight assignments can, nevertheless, 
cause small inaccuracies in proportionate allocation for 
such schedulers. Several techniques can be employed 
to address the problem of infeasible bandwidth assign- 
ments. In the simplest case, processor bandwidth could 
be assigned to applications in absolute terms instead of 
using a relative mechanism such as weights (e.g., assign 
20% of the bandwidth on a processor to a thread). A 
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Figure 1: The Infeasible Weights Problem: an infeasible weight assignment can lead to unfairness in allocated shares 


in multiprocessor environments. 


potential limitation of such absolute allocations is that 
bandwidth unused by an application is wasted, resulting 
in poor resource utilization. To overcome this drawback, 
most modern schedulers that employ this method reallo- 
cate unused processor bandwidth to needy applications 
in a fair-share manner [10, 14]. In fact, it has been shown 
that relative allocations using weights and absolute allo- 
cations with fine-grained reassignment of unused band- 
width are duals of each other [22]. A more promising ap- 
proach is to employ a GPS-based scheduler for each pro- 
cessor and partition the set of threads among processors 
such that each processor is load balanced. Such an ap- 
proach has two advantages: (i) it can provide strong fair- 
ness guarantees on a per-processor basis, and (ii) binding 
a thread to a processor allows the scheduler to exploit 
processor cache locality. A limitation of the approach is 
that periodic repartitioning of threads may be necessary 
since blocked/terminated threads can cause imbalances 
across processors, which can be expensive. Neverthe- 
less, such an approach has been successfully employed 
to isolate applications from one another [1, 8, 26]. 

In summary, GPS-based fair scheduling algorithms or 
simple modifications thereof are unsuitable for fair allo- 
cation of resources in multiprocessor environments. To 
overcome this limitation, we propose a CPU scheduling 
algorithm for multiprocessors that: (i) explicitly distin- 
guishes between feasible and infeasible weight assign- 
ments and (ii) achieves proportionate allocation of pro- 
cessor bandwidth to applications. 


1.3 Research Contributions of this Paper 


In this paper, we present surplus fair scheduling (SFS), 
a predictable CPU scheduling algorithm for symmetric 
multiprocessors. The design of this algorithm has led to 
several key contributions. 

First, we have developed a weight readjustment al- 
gorithm to explicitly deal with the problem of infeasi- 
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ble weight assignments; our algorithm translates a set 
of infeasible weights to the “closest” feasible weight as- 
signment, thereby cnabling all scheduling decisions to 
be based on feasible weights. Our weight readjustment 
algorithm is a novel approach for dealing with infeasible 
weights and one that can be combined with most exist- 
ing GPS-based scheduling algorithms; doing so enables 
these algorithms to vastly reduce the unfairness in their 
allocations for multiprocessor environments. However, 
even with the readjustment algorithm, many GPS-based 
algorithms show unfairness in their allocations, espe- 
cially in the presence of frequent arrival and departures 
of threads. To overcome this drawback, we develop the 
surplus fair scheduling algorithm for proportionate allo- 
cation of bandwidth in multiprocessor environments. A 
key feature of our algorithm is that it does not require 
the quantum length to be known a priori, and hence can 
handle quantums of variable length. 

We have implemented the surplus fair scheduling al- 
gorithm in the Linux kernel and have made the source 
code available to the research community. We have ex- 
perimentally demonstrated the benefits of our algorithm 
over a GPS-based scheduler such as SFQ using sam- 
ple applications and benchmarks. Our experimental re- 
sults show that surplus fair scheduling can achieve pro- 
portionate allocation, application isolation and good in- 
teractive performance for typical application mixes, al- 
beit at the expense of a slight increase in the schedul- 
ing overhead. Together these results demonstrate that 
a proportional-share CPU scheduling algorithm such as 
surplus fair scheduling is not only practical but also de- 
sirable for server operating systems. 

The rest of this paper is structured as follows. Section 
2 presents the surplus fair scheduling algorithm. Sec- 
tion 3 discusses the implementation of our scheduling 
algorithm in Linux. Section 4 presents the results of our 
experimental evaluation. Section 5 presents some limi- 
tations of our approach and directions for future work. 
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Finally, Section 6 presents some concluding remarks. 


2 Proportional-Share CPU Scheduling 
for Multiprocessor Environments 


Consider a multiprocessor server with p processors that 
runs ¢ threads. Let us assume that a user can assign any 
arbitrary weight to a thread. In such a scenario, a thread 
with weight w; should be allocated (w;/ >); w;) frac- 
tion of the total processor bandwidth. Since weights 
can be arbitrary, it is possible that a thread may re- 
quest more bandwidth than it can consume (this occurs 
when the requested fraction rey > a: The CPU 


scheduler must somehow reconcile the presence of such 
infeasible weights. To do so, we present an optimal 
weight readjustment algorithm that can efficiently trans- 
late a set of infeasible weights to the “closest” feasi- 
ble weight assignment. By running this algorithm ev- 
ery time the weight assignment becomes infeasible, the 
CPU scheduler can ensure that all scheduling decisions 
are always based on a set of feasible weights. Given such 
a weight readjustment algorithm, we then present gener- 
alized multiprocessor sharing (GMS)—an idealized al- 
gorithm for fair, proportionate bandwidth allocation that 
is an analogue of GPS in the multiprocessor domain. We 
use the insights provided by GMS to design the surplus 
fair scheduling (SFS) algorithm. SFS is a practical in- 
stantiation of GMS that has lower implementation over- 
heads. 

In what follows, we first present our weight readjust- 
ment algorithm in Section 2.1. We present generalized 
multiprocessor sharing in Section 2.2 and then present 
the surplus fair scheduling algorithm in Section 2.3. 


2.1 Efficient, 


ment 


Optimal Weight Readjust- 


As illustrated in Section 1.2, weight assignments in 
which a thread requests a bandwidth share that ex- 
ceeds the capacity of a processor are infeasible. More- 
over, a feasible weight assignment may become infeasi- 
ble or vice versa whenever a thread blocks or becomes 
runnable. To address these problems, we have developed 
a weight readjustment algorithm that is invoked every 
time a thread blocks or becomes runnable. The algo- 
rithm examines the set of runnable threads to determine 
if the weight assignment is feasible. A weight assigned 
to a thread is said to be feasible if 

ae (1) 


wi P 


We refer to Equation | as the feasibility constraint. If 
a thread violates the feasibility constraint (i.e., requests 
a fraction that exceeds 1/p), then it is assigned a new 
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weight so that its requested share reduces to 1/p (which 
is the maximum share an individual thread can con- 
sume). Doing so for each thread with infeasible weight 
ensures that the new weight assignment is feasible. 
Conceptually, the weight readjustment algorithm pro- 
ceeds by examining each thread in descending order of 
weights to see if it violates the feasibility constraint. 
Each thread that does so is assigned the bandwidth 
of an entire processor, which is the maximum band- 
width a thread can consume. The problem then re- 
duces to checking the feasibility of scheduling the re- 
maining threads on the remaining processors. In prac- 
tice, the readjustment algorithm is implemented us- 
ing recursion—the algorithm recursively examines each 
thread to see if it violates the constraint; the recursion 
terminates when a thread that satisfies the constraint is 
found. The algorithm then assigns a new weight to each 
thread that violates the constraint such that its requested 
fraction equals 1/p. This is achieved by computing the 
average weight of all feasible threads over the remain- 
ing processors and assigning it to the current thread (i.e., 


w= Svjnits SY i), Figure 2 illustrates the complete 
weight etic usirridnt algorithm. 
Our weight readjustment algorithm has the following 


salient features: 


e The algorithm is optimal in the sense that it changes 
the weights of the minimum number of threads 
and the new weights are the “closest” weights that 
reflect the original assignment. This is because 
threads with infeasible weights are assigned the 
nearest feasible weight, and weights of threads that 
satisfy the feasibility constraint never change (and 
hence, they continue to receive bandwidth in their 
requested proportions). 


The algorithm has an efficient implementation. To 
see why, observe that in a p-processor system, 
no more than (p — 1) threads can have infeasible 
weights (since the sum of the requested fractions is 
1, no more than (p — 1) threads can request a frac- 
tion that exceeds 2). Thus, the number of threads 
with infeasible weights depends solely on the num- 
ber of processors and is independent of the total 
number of threads in the system. By maintain- 
ing a list of threads sorted in descending order of 
their weights, the algorithm needs to examine no 
more than the first (p — 1) threads with the largest 
weights. In fact, the algorithm can stop scanning 
the sorted list at the first point where the feasibility 
constraint is satisfied (subsequent threads have even 
smaller weights and hence, request smaller and fea- 
sible fractions). Since the number of processors is 
typically much smaller than the number of threads 
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readjust(array w[1..¢], int i, int p) 
begin 
if(— il 1 
neg : “wb) > > 
begin 
readjust(w([1..t],¢ + 1,p — 1) 
e 5 
oe Bein w[3) 
wi] = = 
end 
end. 


Figure 2: The weight readjustment algorithm: The al- 
gorithm is invoked with an array of weights sorted in 
decreasing order. Initially, i = 1; p denotes the num- 
ber of processors, and t denotes the number of runnable 
threads. If a thread violates the feasibility constraint, 
then the algorithm is recursively invoked for the remain- 
ing threads and the remaining processors. Each infeasi- 
ble weight is then adjusted by setting its requested pro- 
cessor share to 1/p. 


(p << t), the overhead imposed by the readjust- 
ment algorithm is small. 


e Our weight readjustment algorithm can be em- 
ployed with most existing GPS-based scheduling 
algorithms to deal with the problem of infeasible 
weights. We experimentally demonstrate in Section 
4.2 that doing so enables these schedulers to signif- 
icantly reduce (but not eliminate) the unfairness in 
their allocations for multiprocessor environments. 


The weight readjustment algorithm can also 
be employed in conjunction with a random- 
ized proportional-share scheduler such as lottery 
scheduling [27]. Although such a scheduler does 
not suffer from starvation problems due to infeasi- 
ble weights, a set of feasible weights can help such 
a randomized scheduler in making more accurate 
scheduling decisions. 


Given our weight readjustment algorithm, we now 
present an idealized algorithm for proportional-share 
scheduling in multiprocessor environments. 


2.2 Generalized Multiprocessor Sharing 


Consider a server with p processors each with capacity C 
that runs ¢ threads. Let the threads be assigned weights 
Wy, We, W3, ..., We Let ¢; denote the instantaneous 
weight of a thread as computed by the readjustment 
algorithm. At any instant, depending on whether the 
thread satisfies or violates the feasibility constraint, ¢; 
is either the original weight w; or the readjusted weight. 
From the definition of $;, it follows that oS < : 
3 


at all times (our weight readjustment algorithm ensures 
this property). Assume that threads can be scheduled 
for infinitesimally small quanta and let A;(t:,t2) de- 
note the CPU service received by thread i in the inter- 
val [t1,¢2). Then the generalized multiprocessor shar- 
ing (GMS) algorithm has the following property: for any 
interval [t;, ¢2), the amount of CPU service received by 
thread ? satisfies 


A;(t1, ta) > vi. 
Aj(ti, ta) ~ ; 


provided that (i) thread 7 is continuously runnable in the 
entire interval, and (ii) both ¢; and #; remain fixed in 
that interval. Note that the instantaneous weight ¢ re- 
mains fixed in an interval if the thread either satisfies the 
feasibility constraint in the entire interval, or continu- 
ously violates the constraint in the entire interval. It is 
easy to show that Equation 2 implies proportionate allo- 
cation of processor bandwidth.? 

Intuitively, GMS is similar to a weighted round-robin 
algorithm in which threads are scheduled in round-robin 
order (p at a time); each thread is assigned an infinites- 
imally small CPU quantum and the number of quanta 
assigned to a thread is proportional to its weight. In 
practice, however, threads must be scheduled using fi- 
nite duration quanta so as to amortize context switch 
overheads. Consequently, in what follows, we present a 
CPU scheduling algorithm that employs finite duration 
quanta and is a practical approximation of GMS. 


(2) 


2.3 Surplus Fair Scheduling 


Consider a GMS-based CPU scheduling algorithm that 
schedules threads in terms of finite duration quanta. To 
clearly understand how such an algorithm works, we 
first present the intuition behind the algorithm and then 
provide precise details. Let us assume that thread 7 is 
assigned a weight w, and that the weight readjustment 
algorithm is employed to ensure that weights are feasi- 
ble at all times. Let @; denote the instantaneous weight 
of thread 7. Let Aj(t1,¢2) denote the amount of CPU 
service received by thread i in the duration [t),t2), and 
let AG™S(z,, t2) denote the amount of service that the 
thread would have received if it were scheduled using 
GMS. Then, the quantity 


GMS 
a; = Aj(t1, te) — AZ” (ti, te) (3) 
2This can be observed by summing Equation 2 over all runnable 
threads 7, which yields A,(€1,¢2) - j o; > bi ye Aj (ti, t2). 
Since Dig Aj (ti, tz) is the total processor bandwidtb allocated to all 
threads in the interval, we can substitute it by the quantity p-C -(¢2—¢1) 


Hence, we get Ai(ti,t2) > ae -p-C-(t2 — ¢;). Thus each 
;%3 
3 


thread receives processor bandwidth in proportion to its instantaneous 
weight @,. 


- 
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represents the extra service (i.e., surplus) received by 
thread 7 when compared to GMS. To closely emulate 
GMS, a scheduling algorithm should schedule threads 
such that the surplus a; for each thread is as close to 
zero as possible. Given a p-processor system, a simple 
approach for doing so is to actually compute a; for each 
thread and schedule the p threads with the least surplus 
values. If the net surplus is negative, doing so allows a 
thread tocatch up with its allocation in GMS. Even when 
the net surplus of a thread is positive, picking threads 
with the least positive surplus values enables the algo- 
rithm to ensure that the overall deviation from GMS is 
as small as possible (picking a thread with a larger a; 
would cause a larger deviation from GMS). 

A scheduling algorithm that actually uses Equation 3 
to compute surplus values is impractical since it re- 
quires the scheduler to compute Aovs (which in turn 
requires a simulation of GMS). Consequently, we de- 
five an approximation of Equation 3 that enables effi- 
cient computation of the surplus values for each thread. 
Let S;, Sy,..., 5, denote the weighted CPU service re- 
ceived by each thread so far. If thread 7 runs in a quan- 
tum, then S; is incremented as S; = S; + ds where 
q denotes the duration for which the thread ran in that 
quantum. Since 5; is the weighted CPU service received 
by thread i, @; - S; represents the total service received 
by thread 7 so far. Let v denote the minimum value of 
S; over all runnable threads. Intuitively, v represents the 
processor allocation of the thread that has received the 
minimum service so far. Then the surplus service re- 
ceived by thread 7 is defined to be 


a, = $i (S; — v) (4) 


The first term ¢; - S; approximates A;(0,t), which is 
the service received by thread 7 so far, The second term 
oi - V approximates the quantity Ace 5 in Equation 3. 
Thus, a; measures the surplus service received by thread 
7 when compared to the thread that has received the least 
service so far (i.e., v). It follows from this definition 
of a; that a; > 0 for all runnable threads. Scheduling 
a thread with the smallest value of a; ensures that the 
scheduler approximates GMS and each thread receives 
processor bandwidth in proportion to its weight. Since a 
thread is chosen based on its surplus value, we refer to 
the algorithm as surplus fair scheduling (SFS). 

Having provided the intuition for our algorithm, the 
precise SFS algorithm is as follows: 


e Each thread in the system is associated with a 
weight w,, a start tag S; and a finish tag F;. Let ¢; 
denote the instantaneous weight of a thread as com- 
puted by the readjustment algorithm. When a new 
thread arrives, its start tag is initialized as S; = v, 
where v is the virtual time of the system (defined 


below). When a thread runs on a processor, its fin- 
ish tag at the end of the quantum is updated as 


q 
F, = $;+ — 3 
és (5) 


where q is the duration for which the thread ran in 
that quantum and 4; is its instantaneous weight at 
the end of the quantum. Observe that g can vary 
depending on whether the thread utilizes its entire 
allocated quantum or relinquishes the processor be- 
fore the quantum ends due to a blocking event. The 
start tag of a runnable thread is computed as 


max(F;,v) if the thread just woke up 
Si= F; if the thread is continuously 
runnable 
(6) 


Initially, the virtual time of the system is zero. At 
any instant, the virtual time is defined to be the min- 
imum of the start tags over all runnable threads. 
The virtual time remains unchanged if all proces- 
sors are idle and is set to the finish tag of the thread 
that ran last. 


Ateach scheduling instance, SFS computes the sur- 
plus values of all runnable threads as a; = ¢;-(S;—- 
v) and schedules the thread with the least a;; ties 
are broken arbitrarily. 


Our surplus fair scheduling algorithm has the following 
salient features. First, like most GPS-based algorithms, 
SFS is work-conserving in nature—the algorithm en- 
sures that a processor will not idle so long as there are 
runnable threads in the system. Second, since the sur- 
plus a; of a thread depends only on its start tag and not 
the finish tag, SFS does not require the quantum length 
to be known at the time of scheduling (the quantum du- 
ration q is required to compute the finish tag only after 
the quantum ends). This is a desirable feature since the 
duration of a quantum can vary if a thread blocks before 
it is preempted. Third, SFS ensures that blocked threads 
do not accumulate credit for the processor shares they 
do not utilize while sleeping—this is ensured by setting 
the start tag of a newly woken-up thread to at least the 
virtual time (this prevents a thread from accumulating 
credit by sleeping for a long duration and then starving 
other threads upon waking up). Finally, from the defini- 
tion of a; and the virtual time, it follows that at any in- 
stant there is always at least one thread with a; = 0 (this 
is the thread with the minimum start tag, i.e., S; = v 
and also has the least surplus value). Since the thread 
with the minimum surplus value is also the one with 
the minimum start tag, surplus fair scheduling reduces 
to start-time fair queuing (SFQ) [9] in a uniprocessor 
system. Thus, SFS can be viewed as a generalization of 
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SFQ for multiprocessor environments. We experimen- 
tally demonstrate in Section 4.3 that SFS addresses the 
problem of proportionate allocation in the presence of 
frequent arrivals and departures described in Example 2 
of Section 1.2. 


2.4 Fair Allocation versus Processor 
Affinities 

SFS as defined in the previous section achieves pure fair- 
share allocation but does not take processor affinities 
(25] into account while making scheduling decisions. 
Scheduling a thread on the same processor enables it to 
benefit from data cached from previous scheduling in- 
stances and improves the effectiveness of a processor 
cache. SFS can be modified to account for processor 
affinities as follows. Instead of scheduling the thread 
with the least surplus value on a processor, SFS can 
instead examine the first 6 threads with the least sur- 
plus values and pick one which was previously sched- 
uled on that processor. If no such thread exists, then 
the scheduler simply picks the thread with the least sur- 
plus value for execution. The quantity G is a tunable pa- 
rameter and is referred to as the processor affinity bias. 
Using 8 = 1 reduces to pure fair-share scheduling; a 
large value of B increases the probability of finding a 
thread with an affinity for a particular processor. Ob- 
serve that processor-affinity based scheduling and fair- 
share scheduling can be conflicting goals. Using a large 
processor affinity bias can cause SFS to deviate from 
GMS-based fair allocation but allows the scheduler to 
improve performance by exploiting cache locality. In 
contrast, a small value of the bias enables SFS to provide 
better fairness guarantees but can degrade cache perfor- 
mance. 


3 Implementation Considerations 


We have implemented surplus fair scheduling in the 
Linux kernel and have made the source code publicly 
available to the research community.? The entire im- 
plementation effort took less than three weeks and was 
around 1500 lines of code. In the rest of this section, 
we present the details of our kernel implementation and 
explain some of our key design decisions. 


3.1 SFS Data Structures and Implementa- 
tion 


The implementation of surplus fair scheduling was done 
in version 2.2.14 of the Linux kernel. Our imple- 
mentation replaces the standard time sharing sched- 
uler in Linux; the modified kernel schedules all 


3The source code for our implementation is available from 
http://www.cs.umass.edu/lass/software/gms. 


threads/processes using SFS. Each thread in the system 
is assigned a default weight of |; the weight assigned 
to a thread can be modified (or queried) using two new 
system calls—setweight and getweight. The pa- 
rameters expected by these system calls are similar to the 
setpriority and getpriority system calls em- 
ployed by the Linux time sharing scheduler. SFS allows 
the weight assigned to a thread to be modified at any 
time (just as the Linux time sharing scheduler allows the 
priority of a thread to be changed on-the-fly). 

Our implementation of SFS maintains three queues. 
The first queue consists of all runnable threads in de- 
scending order of their weights. The other two queues 
consist of all runnable threads in increasing order of 
start tags and surplus values, respectively. The first 
queue is employed by the readjustment algorithm to de- 
termine the feasibility of the assigned weights (recall 
from Section 2.1 that maintaining a list of threads sorted 
by their weights enables the weight readjustment algo- 
rithm to be implemented efficiently). The second queue 
is employed by the scheduler to compute the virtual 
time; since the queue is sorted on start tags, the virtual 
time at any instant is simply the start tag of the thread 
at the head of the queue. The third queue is used to 
determine which thread to schedule next—maintaining 
threads sorted by their surplus values enables the sched- 
uler to make scheduling decisions efficiently. 

Given these data structures, the actual scheduling is 
performed as follows. Whenever a quantum expires or 
one of the currently running threads blocks, the Linux 
kernel invokes the SFS scheduler. The SFS scheduler 
first updates the finish tag of the thread relinquishing the 
processor and then computes its start tag (if the thread 
is still runnable). The scheduler then computes the new 
virtual time; if the virtual time changes from the pre- 
vious scheduling instance, then the scheduler must up- 
date the surplus values of all runnable threads (since a; 
is a function of v) and re-sort the queue. The sched- 
uler then picks the thread with the minimum surplus and 
schedules it for execution. Note that since a running 
thread may not utilize its entire allocated quantum due 
to blocking events, quantums on different processors are 
not synchronized; hence, each processor independently 
invokes the SFS scheduler when its currently running 
thread blocks or is preempted. Finally, the readjust- 
ment algorithm is invoked every time the set of runnable 
threads changes (i.e., after each arrival, departure, block- 
ing event or wakeup event), or if the user changes the 
weight of a thread. 


3.2 Implementation Complexity and Opti- 


mizations 


The implementation complexity of the SFS algorithm is 
as follows: 
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e New arrival or a wakeup event: The newly ar- 


rived/woken up thread must be inserted at the ap- 
propriate position in the three run queues. Since 
the queues are in sorted order, using a linear search 
for insertions takes O(t), where ¢ is the number of 
runnable threads. The complexity can be further 
reduced to O(log t) if binary search is used to de- 
termine the insert position. The readjustment al- 
gorithm is invoked after the insertion, which has a 
complexity of O(p). Hence, the total complexity is 
O(t + p). 


Departure or a blocking event: The termi- 
nated/blocked thread must be deleted from the run 
queue, which is O(1) since our queues are doubly 
linked lists. The readjustment algorithm is then in- 
voked for the new run queue, which takes O(p). 
Hence, the total complexity is O(p). 


Scheduling decisions: The scheduler first updates 
finish and start tags of the thread relinquishing the 
processor and computes the new virtual time, all 
of which are constant time operations. If the vir- 
tual time is unchanged, the scheduler only needs 
to pick the thread with minimum surplus (which 
takes O(1) time), If the virtual time increases 
from the previous scheduling instance, then the 
scheduler must first update the surplus values of 
all runnable threads and re-sort the queue. Sorting 
is an O(¢ log #) operation, while updating surplus 
values takes O(t). Hence, the total complexity is 
O(t log t). The run time performance, in the aver- 
age case, can be improved by observing the follow- 
ing. Since the queue was in sorted order prior to the 
updates, in practice, the queue remains mostly in 
sorted order after the new surplus values are com- 
puted. Hence, we employ insertion sort to re-sort 
the queue, since it has good run time performance 
on mostly-sorted lists. Moreover, updates and sort- 
ing are required only when the virtual time changes. 
The virtual time is defined to be the minimum start 
tag in the system, and hence, in a p-processor sys- 
tem, typically only one of the p currently running 
threads have this start tag. Consequently, on av- 
erage, the virtual time changes only once every p 
scheduling instances, which amortizes the schedul- 
ing overhead over a larger number of scheduling 
instances. 


Synchronization issues: Synchronization overheads 
can become an issue in SMP servers if the schedul- 
ing algorithm imposes a large overhead. Despite its 
O(t log t) overhead, SFS can be implemented effi- 
ciently due to the following reasons. First, we have 
developed a scheduling heuristic (described next) 
that reduces the scheduling overhead to a constant. 


Second, although the readjustment algorithm needs 
to lock the run queue while examining the feasi- 
bility constraint for runnable threads, as explained 
earlier, these checks can be done efficiently in O(p) 
time (independent of the number of threads in the 
system). Finally, the granularity of locks required 
by SFS is identical to that in the Linux SMP sched- 
uler. In fact, our implementation reuses that portion 
of the code. 


Since the scheduling overhead of SFS grows with the 
number of runnable threads, we have developed a heuris- 
tic to limit the scheduling overhead when the number of 
runnable threads becomes large. Our heuristic is based 
on the observation that a; = @; - (S; — v) and hence, 
the thread with the minimum surplus typically has either 
a small weight, a small start tag, or a small surplus in 
the previous scheduling instance. Consequently, exam- 
ining a few threads with small start tags, small weights, 
or small prior surplus values, computing their new sur- 
pluses and choosing the thread with minimum surplus 
is a good heuristic in practice. Since our implemen- 
tation already maintains three queues sorted by ¢i, Sj 
and aj, this can be trivially done by examining the first 
few threads in each queue, computing their new sur- 
plus values and picking the thread with the least surplus. 
This obviates the need to update the surpluses and to 
re-sort every time the virtual time changes; the sched- 
uler needs to do so only every so often and can use the 
heuristic between updates (infrequent updates and sort- 
ing are still required to maintain a high accuracy of the 
heuristic). Hence, the scheduling overhead reduces to 
a constant and becomes independent of the number of 
runnable threads in the system (updates to a; and sorting 
continue to be O(t log t), but this overhead is amortized 
over a large number of scheduling instances). Moreover, 
since the heuristic examines multiple runnable threads, 
it can be easily combined with the technique proposed in 
Section 2.4 to account for processor affinities. We con- 
ducted several simulation experiments to determine the 
efficacy of this heuristic. Figure 3 plots the percentage 
of the time our heuristic successfully picks the thread 
with the minimum surplus (we omit detailed results due 
to space constraints). The figure shows that, in a quad- 
processor system, examining the first 20 threads in each 
queue provides sufficient accuracy (> 99%) even when 
the number of runnable threads is as large as 5000 (the 
actual number of threads in the system is typically much 
larger). 

As a final caveat, the Linux kernel uses only integer 
variables for efficiency reasons and avoids using float- 
ing point variables as a data type. Since the computation 
of start tags, finish tags and surplus values involves float- 
ing point operations, we simulate floating point variables 
using integer variables. To do so we scale each floating 
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Figure 3: Efficacy of the scheduling heuristic: the fig- 
ure plots the percentage of the time the heuristic success- 
fully picks the thread with the least surplus for varying 
run queue lengths and varying number of threads exam- 
ined. 


point operation in SFS by a constant factor. Employing 
a scaling factor of 10” for each floating point operation 
enables us to capture n places beyond the decimal point 
in an integer variable (e.g., the finish tag is computed 
as Fj = 8; + apo" ). The scaling factor is a compile 
time parameter and can be chosen based on the desired 
accuracy—we found a scaling factor of 104 to be ade- 
quate for most purposes. Observe that a large scaling 
factor can hasten the warp-around in the start and finish 
tags of long running threads; we deal with wraparound 
by adjusting all start and finish tags with respect to the 
minimum start tag in the system and resetting the virtual 
time. 


4 Experimental Evaluation 


In this section, we experimentally evaluate the surplus 
fair scheduling algorithm and demonstrate its efficacy. 
We conducted several experiments to (i) examine the 
benefits of the readjustment algorithm, (ii) demonstrate 
proportionate allocation of processor bandwidth in SFS, 
and (iii) measure the scheduling overheads imposed by 
SFS. We used SFQ and the Linux time sharing scheduler 
as the baseline for our comparisons. In what follows, we 
first describe the test-bed for our experiments and then 
present the results of our experimental evaluation. 


4.1. Experimental Setup 


The test-bed for our experiments consisted of a 500 
MHz Pentium III-based dual-processor PC with 128 MB 
RAM, 13GB SCSI disk, and a 100 Mb/s 3-Com ethernet 


card (model 3c595). The PC ran the default installa- 
tion of Red Hat Linux 6.0. We used version 2.2.14 of 
the Linux kernel for our experiments; depending on the 
experiment, the kernel employed either SFS, SFQ or the 
time sharing scheduler to schedule threads. In each case, 
we used a quantum duration of 200 ms, which is the de- 
fault quantum duration employed by the Linux kernel. 
The Linux kernel (and hence, our SFS scheduler) can be 
configured to employ finer-grain quanta; however, we do 
not examine the implications of doing so in this paper. 
All experiments were run when the system was lightly 
loaded. Note that due to resource constraints, our exper- 
iments were run on a system with only two processors; 
we have verified the efficacy of SFS on a larger number 
of processors via simulations (we omit these results due 
to space constraints). 

The workload for our experiments consisted of a 
combination of real-world applications, benchmarks, 
and sample applications that we wrote to demonstrate 
specific features. These applications include: (i) Inf, 
a compute-intensive application that performs computa- 
tions in an infinite loop; (ii) Interact, an I/O bound inter- 
active application; (iii) thttpd, a single-threaded event- 
based web server, (iv) mpeg.play, the Berkeley software 
MPEG-1 decoder, (v) gcc, the GNU C compiler, (vi) 
disksim, a publicly-available disk simulator, (vii) dhry- 
stone, a compute-intensive benchmark for measuring in- 
teger performance, and (viii) lmbench, a benchmark that 
Measures various aspects of operating system perfor- 
mance. Next, we describe the experimental results ob- 
tained using these applications and benchmarks. 


4.2 Impact of the Weight Readjustment Al- 


gorithm 


To show that the weight readjustment algorithm can 
be combined with existing GPS-based scheduling algo- 
rithms to reduce the unfairness in their allocations, we 
conducted the following experiment. At t=0, we started 
two Inf applications (JT; and T,) with weights 1:10. At 
t=15s, we started a third Inf application (73) with a 
weight of 1. Task Tz was then stopped at t=30s.We 
measured the processor shares received by the three ap- 
plications (in terms of number of loops executed) when 
scheduled using SFQ; we then repeated the experiment 
with SFQ coupled with the weight readjustment algo- 
rithm. Observe that this experimental scenario corre- 
sponds to the infeasible weights problem described in 
Example 1 of Section 1.2. As expected, SFQ is unable 
to distinguish between feasible and infeasible weight as- 
signments, causing task 7, to starve upon the arrival of 
task T3 at t=15s (see Figure 4(a)). In contrast, when cou- 
pled with the readjustment algorithm, SFQ ensures that 
all tasks receive bandwidth in proportion to their instan- 
taneous weights (1:1 from t=0 through t=15, and 1:2:1 
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Figure 4: Impact of the weight readjustment algorithm 
starvation and reduces the unfairness in its allocations. 


from t=15 through t=30, and 1:1 from then on). See Fig- 
ure 4(b). This demonstrates that the weight readjustment 
algorithm enables a GPS-based scheduler such as SFQ 
to reduce the unfairness in its allocations in multipro- 
cessor environments. 


4.3. Comparing SFQ and SFS 


In this section, we demonstrate that even with the weight 
readjustment algorithm, SFQ can show unfairness in 
multiprocessor environments, especially in the presence 
of frequent arrivals and departures (as discussed in Ex- 
ample 2 of Section 1.2). We also show that SFS does 
not suffer from this limitation. To demonstrate this be- 
havior, we started an Inf application (T)) with a weight 
of 20, and 20 Inf applications (collectively referred to 
as Tp-21), each with weight of 1. To simulate frequent 
arrivals and departures, we then introduced a sequence 
of short Inf tasks (Tsport) into the system. Each of these 
short tasks was assigned a weight of 5 and ran for 300ms 
each; each short task was introduced only after the pre- 
vious one finished. Observe that the weight assignment 
is feasible at all times, and the weight readjustment al- 
gorithm never modifies any weights. We measured the 
processor share received by each application (in terms 
of the cumulative number of loops executed). Since the 
weights of T,, T2-21 and Tsport are in the ratio 20:20:5, 
we expect 7; and T2_2; to receive an equal share of the 
total bandwidth and this share to be four times the band- 
width received by Tsport. However, as shown in Fig- 
ure 5(a), SFQ is unable to allocate bandwidth in these 
proportions (in fact, each set of tasks receives approx- 
imately an equal share of the bandwidth). SFS, on the 
other hand, is able to allocate bandwidth approximately 
in the requested proportion of 4:4:1 (see Figure 5(b)). 
The primary reason for this behavior is that SFQ 
schedules threads in “spurts’—threads with larger 
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weights (and hence, smaller start tags) run continuously 
for some number of quanta, then threads with smaller 
weights run for a few quanta and the cycle repeats. In the 
presence of frequent arrivals and departures, schedul- 
ing in such “spurts” allows tasks with higher weights 
(T, and Ts,or¢ in our experiment) to run almost con- 
tinuously on the two processors; T2—-21 get to run infre- 
quently. Thus, each T'sporz¢ task gets as much processor 
share as the higher weight task T,; since each Ty port task 
is short lived, SFQ is unable to account for the band- 
width allocated to the previous task when the next one 
arrives. SFS, on the other hand, schedules each appli- 
cation based on its surplus. Consequently, no task can 
run continuously and accumulate a large surplus without 
allowing other tasks to run first; this finer interleaving 
of tasks enables SFS to achieve proportionate allocation 
even with frequent arrivals and departures. 


4.4 Proportionate Allocation and Applica- 
tion Isolation in SFS 


Next, we demonstrate proportionate allocation and ap- 
plication isolation of tasks in SFS. To demonstrate pro- 
portionate allocation, we ran 20 background dhrystone 
processes, each with a weight of 1. We then ran two 
thttpd web servers and assigned them different weights 
(1:1, 1:2, 1:4 and 1:7). A large number of requests were 
then sent to each web server. In each case, we mea- 
sured the average processor bandwidth allocated to each 
web server (the background dhrystone processes were 
necessary to ensure that all weights were feasible at all 
times; without these processes, no weight assignment 
other than 1:1 would be feasible in a dual-processor sys- 
tem). As shown in Figure 6(a), the processor bandwidth 
allocated by SFS to each web server is in proportion to 
its weight. 

To show that SFS can isolate applications from one 
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Figure 5: The Short Jobs Problem. Frequent arrivals and departures in multiprocessor environments prevent SFQ 
from allocating bandwidth in the requested proportions. SFS does not have this drawback. 
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Figure 6: Proportionate allocation and application isolation in SFS. Figure (a) shows that SFS allocates bandwidth 
in the requested proportions. Figure (b) shows that SFS can isolate a software video decoder from background 
compilations. Figure (c) shows that SFS provides interactive performance comparable to time sharing 


another, we ran the mpeg-play software decoder in the 
presence of a background compilation workload. The 
decoder was given a large weight and used to decode a 5 
minute long MPEG-| clip that had an average bit rate of 
1.49 Mb/s. Simultaneously, we ran a varying number of 
gcc compile jobs, each with a weight of 1. The scenario 
represents video playback in the presence of background 
compilations; running multiple compilations simultane- 
ously corresponds to a parallel make job (i.e., make -)) 
that spawns multiple independent compilations in paral- 
lel. Observe that assigning a large weight to the decoder 
ensures that the readjustment algorithm will effectively 
assign it the bandwidth of one processor, and the compi- 
lations jobs share the bandwidth of the other processor. 


We varied the compilation workload and measured 
the frame rate achieved by the software decoder. We 
then repeated the experiment with the Linux time shar- 
ing scheduler. As shown in Figure 6(b), SFS is able to 
isolate the video decoder from the compilation work- 


load, whereas the Linux time sharing scheduler causes 
the processor share of the decoder to drop with increas- 
ing load. We hypothesize that the slight decrease in the 
frame rate in SFS is caused due to the increasing number 
of intermediate files created and written by the gcc com- 
piler, which interferes with the reading of the MPEG-1 
file by the decoder. 


Our final experiment consisted of an I/O-bound inter- 
active application Interact that ran in the presence of a 
background simulation workload (represented by some 
number of disksim processes). Each application was as- 
signed a weight of 1, and we measured the response time 
of Interact for different background loads. As shown in 
Figure 6(c), even in the presence of a compute-intensive 
workload, SFS provides response times that are compa- 
rable to the time sharing scheduler (which is designed to 
give higher priority to /O-bound applications). 
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Context switch (2 proc/ OKB) 
Context switch (8 proc/ 16KB) 
Context switch (16 proc/ 64KB) 
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Table 1: Scheduling Overheads reported by Imbench 


4.5 Benchmarking SFS: Scheduling Over- 
heads 


We used lmbench, a publicly available operating sys- 
tem benchmark, to measure the overheads imposed by 
the SFS scheduler. We ran Imbench on a lightly loaded 
machine with SFS and repeated the experiment with the 
Linux time sharing scheduler. In each case, we averaged 
the statistics reported by Lmbench over several runs to 
reduce experimental error. Note that the SFS code is 
untuned, while the time sharing scheduler has benefited 
from careful tuning by the Linux kernel developers. Ta- 
ble 1 summarizes our results (we report only those Im- 
bench statistics that are relevant to the CPU scheduler). 
As shown in Table 1, the overhead of creating processes 
(measured using the fork and exec system calls) is 
comparable in both schedulers. The context switch over- 
head, however, increases from | ps to 4 ps for two OKB 
processes (the size associated with a process is the size 
of the array manipulated by each process and has impli- 
cations on processor cache performance [13]). Although 
the overhead imposed by SFS is higher, it is still consid- 
erably smaller than the 200 ms quantum duration em- 
ployed by Linux. The context switch overheads increase 
in both schedulers with increasing number of processes 
and increasing process sizes. SFS continues to have a 
slightly higher overhead, but the percentage difference 
between the two schedulers decreases with increasing 
process sizes (since the restoration of the cache state be- 
comes the dominating factor in context switches). 
Figure 7 plots the context switch overhead imposed 
by the two schedulers for varying number of 0 KB pro- 
cesses (the array sizes manipulated by each process was 
set to zero to eliminate caching overheads from the con- 
text switch times). As shown in the figure, the context 
switch overhead increases sharply as the number of pro- 
cesses increases from 0 to 5, and then grows with the 
number of processes. The initial increase is due to the 
increased book-keeping overheads incurred with a larger 
number of runnable processes (scheduling decisions are 
trivial when there is only one runnable process and re- 
quire minimal updates to kernel data structures). The 
increase in scheduling overhead thereafter is consistent 
with the complexity of SFS reported in Section 3.2 (the 
scheduling heuristic presented in that section was not 


Scheduling overhead imposed by OKB processes 


Time sharing “in 


v 


Context switch time (microsec) 











20 2 30 35 40 45 50 
Number of processes 


Figure 7: Scheduling overheads reported by lmbench 
with varying number of processes. 


used in this experiment). Interestingly, the Linux time 
sharing scheduler also imposes an overhead that grows 
with the number of processes. 


5 Limitations and Directions for Fu- 
ture Work 


Whereas surplus fair scheduling achieves proportionate 
allocation of bandwidth in multiprocessor environments, 
it has certain limitations. In what follows, we discuss 
some of the limitations of SFS and opportunities for fu- 
ture work. 

In SFS, the QoS requirements of an application are 
distilled to a single dimension, namely its rate (which 
is specified using a weight). That is, SFS is a pure 
proportional-share CPU scheduler. Applications can 
have requirements along other dimensions. For instance, 
interactive applications tend to be more latency-sensitive 
than batched applications, or a certain application may 
need to have higher priority than other applications. 
Recent research has extended GPS-based proportional- 
share schedulers to account for these dimensions. For 
instance, SMART [16] enhances a GPS-based sched- 
uler with priorities, while BVT [7] extends a GPS-based 
scheduler to handle latency requirements of threads. We 
plan to explore similar extensions for GMS-based sched- 
ulers such as SFS as part of our future work. 

GPS-based schedulers such as SFQ can perform hi- 
erarchical scheduling. This allows threads to be ag- 
gregated into classes and CPU shares to be allocated 
on a per-class basis. Consequently, hierarchical sched- 
ulers can handle resource principals (e.g., processes) 
consisting of multiple threads. Many hierarchical sched- 
ulers also support class-specific schedulers, in which the 
bandwidth allocated to a class is distributed among indi- 
vidual threads using a class-specific scheduling policy. 
SES is a single-level scheduler and can only handle re- 
source principals with a single thread. We are currently 
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enhancing SFS to overcome both limitations. To handle 
resource principals with multiple threads, we are gener- 
alizing our weight readjustment algorithm. Specifically, 
a resource principal with 7 threads can be simultane- 
ously scheduled on 7 processors. The feasibility con- 
straint for such a resource principal is specified as 





eS anita 7 
Sy 7) 
We are modifying the weight readjustment algorithm 
to incorporate this constraint. To support hierarchical 
scheduling, we are modifying SFS to allow independent 
resource principals to be grouped into classes in a hier- 
archical manner. Assuming that these groups are speci- 
fied in the form of a tree, our enhanced algorithm allows 
weights to be specified for each node (sub-class) in the 
tree. Our weight readjustment algorithm then ensures 
feasibility of the weights assigned to each node based 
on the number of runnable threads in that sub-tree. 

SMP-based time-sharing schedulers employed by 
conventional operating systems take caching effects into 
account while scheduling threads [25]. As explained in 
Section 2.4, SFS can be modified to take such proces- 
sor affinities into account while making scheduling deci- 
sions. However, the implications of doing so on fairness 
guarantees and cache performance need further investi- 
gation. 

Regardless of whether resources are allocated in rela- 
tive or absolute terms, a predictable scheduler will need 
to employ techniques to restrict the number of threads 
in the system in order to provide performance guaran- 
tees. While some schedulers integrate an admission con- 
trol test with the scheduling algorithm, others implicitly 
assume that such an admission control test will be em- 
ployed but do not specify a particular test. SFS falls 
into the latter category—the system will need to em- 
ploy admission control if threads desire specific perfor- 
mance guarantees. Assuming such a test is employed, 
fair proportional-share schedulers have been shown to 
provide bounds on the throughput received and the la- 
tency incurred by threads [4, 9]. We are currently an- 
alyzing SFS to determine the performance guarantees 
that can be provided to a thread. Note, however, that 
the scheduling heuristic and the processor affinity bias 
can weaken the guarantees provided by SFS. 

Finally, proportional-share schedulers such as SFS 
need to be combined with tools that enable a user to 
determine an application’s resource requirements. Such 
tools should, for instance, allow a user to determine the 
processing requirements of an application (for instance, 
by application profiling), translate these requirements to 
appropriate weights, and modify weights dynamically if 
these resource requirements change [6, 21]. Translating 
application requirements such as rate to an appropriate 
set of weights is the subject of future research. 


6 Concluding Remarks 


In this paper, we argued that the infeasibility of cer- 
tain weight assignments causes unfairness or starvation 
in many existing proportional-share schedulers when 
employed for multiprocessor servers. We presented a 
novel weight readjustment algorithm to translate infea- 
sible weight assignments to a feasible set of weights. We 
showed that our algorithm enables existing proportional- 
share schedulers such as SFQ to significantly reduce, but 
not eliminate, the unfairness in their allocations. We 
then presented the idealized generalized multiprocessor 
sharing algorithm and derived surplus fair scheduling, 
which is a practical instantiation of GMS. We imple- 
mented SFS in the Linux kemel and demonstrated its 
efficacy through an experimental evaluation. Our exper- 
iments indicate that a proportional-share CPU scheduler 
such as SFS is not only practical but also desirable for 
general-purpose operating systems. As part of future 
work, we plan to extend SFS to do hierarchical schedul- 
ing as well as enhance proportional-share schedulers to 
account for priorities and delay. 
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Abstract 


This work is focused on processor allocation in shared- 
memory multiprocessor systems, where no knowledge 
of the application is available when applications are sub- 
mitted. We perform the processor allocation taking into 
account the characteristics of the application measured 
at run-time. We want to demonstrate the importance of 
an accurate performance analysis and the criteria used to 
distribute the processors. With this aim, we present the 
SelfAnalyzer, an approach to dynamically analyzing the 
performance of applications (speedup, efficiency and 
execution time), and the Performance-Driven Processor 
Allocation (PDPA), a new scheduling policy that distrib- 
utes processors considering both the global conditions 
of the system and the particular characteristics of run- 
ning applications. This work also defends the impor- 
tance of the interaction between the medium-term and 
the long-term scheduler to control the multiprogram- 
ming level in the case of the clairvoyant scheduling pol- 


icies!. We have implemented our proposal in an SGI 
Origin2000 with 64 processors and we have compared 
its performance with that of some scheduling policies 
proposed so far and with the native IRIX scheduling 
policy. Results show that the combination of the SelfAn- 
alyzer+PDPA with the medium/long-term scheduling 
interaction outperforms the rest of the scheduling poli- 
cies evaluated. The evaluation shows that in workloads 
where a simple equipartition performs well, the PDPA 
also performs well, and in extreme workloads where all 
the applications have a bad performance, our proposal 
can achieve a speedup of 3.9 with respect to an equipar- 
tition and 11.8 with respect to the native IRIX schedul- 


ing policy. 


1 Introduction 


The performance of current shared-memory multipro- 
cessors systems heavily depends on the allocation of 
processors to parallel applications. This is especially 


1. Those scheduling policies that consider the application 
characteristics 


important in NUMA systems, such as the SGI 
Origin2000 [SGI98]. This work attacks the problem of 
the processor allocation in an execution environment 
where no knowledge of the application is available 
when applications are submitted. 


Many researchers have considered the use of application 
characteristics in processor scheduling [Brecht96] 
(Chiang94][Marsh9 1][N guy en96][N guyenZV96][Parso 
ns96]. In these works, parallel applications are charac- 
terized by different parameters such as the maximum 
speedup, the average parallelism, or the size of the 
working set. Performing the processor allocation with- 
out taking into account these characteristics can result in 
a bad utilization of the machine. For instance, allocating 
a high number of processors to a parallel application 
with small speedup will result in a loss of processor per- 
formance. 


Traditionally, characteristics of parallel applications 
were calculated in two different ways. The first 
approach is that the user or system administrator per- 
forms several executions under different scenarios, such 
as the input data or the number of processors, and col- 
lects several measurements. A second approach, used in 
research environments [Brecht96] [Chiang94] 
{[Helmbold90] [Leutenegger90] [Madhukar95] 
[Parsons96] [Sevcik94], defines a job model, character- 
izing the applications by a set of parameters, such as the 
average of parallelism or the speedup. This information 
is provided to the OS as an a priori input, to be taken 
into account in subsequent executions. 


This approach has several drawbacks. First of all, these 
tests can be very time-consuming, even, they can be pro- 
hibitive due to the number of combinations. Further- 
more, many times the performance of the application 
depends on the particular input data (data size, number 
of iterations). Second, the behavior of the applications is 
influenced by issues such as the characteristics of the 
processors assigned to them, or the run-time mapping of 
processes to processors, or the memory placement. 
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These issues determine the performance of the applica- 
tions and are only available at run-time. Finally, the dif- 
ferent analytic models proposed so far are not able to 
represent the behavior of the application at run-time. 
Moreover, analytic models try to characterize the appli- 
cation when it is individually executed, not in a shared 
environment. Most of the previous approaches are based 
on analytic models. 


On the other hand, the typical way to execute a parallel 
application in production systems is through a long-term 
scheduler, i.e. a queueing system [Feitelson95]. The 
queueing system manages the number of applications 
that are executed simultaneously, usually known as the 


multiprogramming level!. In this execution environ- 
ment, jobs are queued until the queueing system decides 
to execute it. This work is based on execution environ- 
ments where the applications arrival is controlled by a 
long-term scheduler. 


This work relies on the utilization of the characteristics 
of the applications calculated at run-time and on using 
this information for processor scheduling. In particular, 
we propose to use the speedup and the execution time 
with P processors. This work is focused on demonstrat- 
ing the importance of: the accuracy of the measurements 
of the application characteristics, the criteria used to 
perform the processor scheduling, and the coordination 
with the queueing system, in the performance that may 
be achieved by parallel applications. With this aim, we 
present: (1) a new approach to measure the speedup and 
the execution time of the parallel applications, the SelfA- 
nalyzer, (2) a new scheduling policy that uses the 
speedup and the execution time to distribute processors, 
the Performance-Driven Processor Allocation (PDPA) 
policy, (3) and a new approach to coordinating the 
(medium-term) scheduler with the queueing system 
(long-term scheduler). 


Our approach has been implemented in an Origin2000 
with 64 processors. Applications from the SPECFp95 
benchmark suite and from the NAS benchmarks have 
been used to evaluate the performance of our proposal. 
All the benchmarks used in the evaluation are parallel- 
ized with OpenMP [OpenMP2000] directives. Finally, 
in the current implementation we assume that applica- 
tions are malleable [Feitelson97], applications that can 
adjust to changing allocations at runtime. 


1. In our environment, the multiprogramming level is nor- 
mally set to allow the simultaneous execution of a 
small number of applications (two, three or four). 
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Results show that the combination of the SelfAna- 
lyzer+PDPA with the medium/long-term scheduling 
interaction outperforms the rest of the scheduling poli- 
cies evaluated. The evaluation shows that, in workloads 
where a simple equipartition performs well, the PDPA 
also performs well, and in extreme workloads where all 
the applications have a bad performance, our proposal 
can achieve a speedup of 3.9 with respect to an equipar- 
tition and 11.8 with respect to the native IRIX schedul- 


ing. 


The remainder of this paper is organized as follows: 
Section 2 presents the related work. Section 3 presents 
the execution environment in which we have developed 
this work. Section 4 presents the PDPA scheduling pol- 
icy. Section 5 presents the evaluation of the PDPA com- 
pared to some scheduling policies proposed so far and 
the IRIX scheduling policy. Finally, section 6 presents 
the conclusions of this work. 


2 Related Work 


Many researchers have studied the use of characteristics 
of the applications calculated at run time to perform pro- 
cessor scheduling. Majumdar et al [Majumdar9 1], Par- 
sons et al [Parsons96], Sevcik [Sevcik94][Sevcik89], 
Chiang et al [Chiang94] and Leutenegger et al 
[Leutenegger90] have studied the usefulness of using 
application characteristics in processor scheduling. 
They have demonstrated that parallel applications have 
very different characteristics such as the speedup or the 
average of parallelism that must be taken into account 
by the scheduler. All these works have been carried out 
using simulations, not through the execution of real 
applications, and assuming a priori information. 


Some researchers propose that applications should mon- 
itor themselves and tune their parallelism, based on their 
performance. Voss et al [Voss99] propose to dynami- 
cally detect parallel loops dominated by overheads and 
to serialize them. Nguyen et al ([Nguyen96] 
[NguyenZV96] propose SelfTuning, to dynamically 
measure the efficiency achieved in iterative parallel 
regions and select the best number of processors to exe- 
cute them considering the efficiency. These works have 
demonstrated the effectiveness of using run-time infor- 
mation. 


Other authors propose to communicate these application 
characteristics to the scheduler and let it to perform the 
processor allocation using this information. Hamidza- 
deh [Hamidzadeh94] proposes to dynamically optimize 
the processor allocation by dedicating a processor to 
search the optimal allocation. This proposal does not 
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consider characteristics of the applications, only the sys- 
tem performance. Nguyen et al 
[Nguyen96][NguyenZV96] also use the efficiency of the 
applications, calculated at run-time, to achieve an equal- 
efficiency in all the processors. Brecht et al [Brecht96] 
use parallel program characteristics in dynamic proces- 
sor allocations policies, (assuming a priori informa- 
tion). McCann et al [McCann93] propose a scheduler 
that dynamically adjust the number of processors allo- 
cated to the parallel applications to improve the proces- 
sor utilization. Their approach considers the application- 
provided idleness to allocate the processors, resulting in 
a large number of re-allocations. 


To obtain application characteristics, previous systems 
have taken approaches such as the use of the hardware 
counters provided by the architecture, or monitoring the 
execution time of the different phases of the applica- 
tions. Weissman [Weissman98] uses the performance 
counters provided by modern architectures to improve 
the thread locality. McCann et al [McCann93] monitor 
the idle time consumed by the processors. Nguyen et al 
([Nguyen96][NguyenZV96] combined both, the use of 
hardware counters and the measurement of the idle peri- 
ods of the applications. 


The most studied characteristic of parallel applications 
has been the speedup. Several theoretical studies have 
analyzed the relation between the speedup and other 
characteristics such as the efficiency. Eager, Zahorjan 
and Lazowska define in [Eager89] the speedup and the 
efficiency. Speedup is defined for each number of pro- 
cessors P as the ratio between the execution time with 
one processor and with P processors. Efficiency is 
defined as the average utilization of the P allocated pro- 
cessors. The relationship between efficiency and 
speedup is shown in Figure 1 


cv 


SCP) = eR 


= S(P) 


Figure 1: Speedup and efficiency definitions 


Helmbold et al analyze in [Helmbold90] the causes of 
loss of speedup and demonstrate that the super-linear 
speedup exists basically due to memory cache effects. 


Our work has several characteristics that differ from the 
previously mentioned proposals. First of all, with 
respect the parameters used by the scheduling policy, 
our proposal considers two characteristics of the appli- 
cations: the speedup and the execution time. We also 
propose to consider the variation in these characteristics 
proportionally to the variation in the number of allo- 


cated processors. Second, we differ in the way the appli- 
cation characteristics are acquired. We believe that 
parameters such as the speedup can only be accurately 
calculated as the relation between two measurements, as 
opposed to [Nguyen96]. Furthermore, since the execu- 
tion time of the applications is used by the scheduler, we 
propose a new approach to estimate the execution time 
of the whole application. Our measurements are based 
on the time, not on the hardware performance counters. 
In this way our method is independent from the archi- 
tecture. Third, we have implemented and evaluated our 
proposal using real applications and a real architecture, 
the Origin2000. Simulations do not consider important 
issues of the architecture such as the data locality. And 
finally, we consider the benefit provided by the interac- 
tion of the (medium-term) scheduler with the long-term 
scheduler (queueing system). 


3 Performance-Driven Processor Allocation 


This section presents the three components of this work. 
Figure 2 shows the general overview of our execution 
environment. (1) Parallel applications calculate their 
performance through the SelfAnalyzer which informs 
the scheduler about the achieved speedup with the cur- 
rent number of processors, the estimation of the execu- 
tion time of the whole application, and the requested 
number of processors. (2) Periodically (at each quan- 


tum! expiration) the scheduler wakes up and applies the 
scheduling policy, the PDPA. The PDPA distributes the 
processors among the parallel applications considering 
their characteristics, global system status, such as the 
number of processors allocated in the previous quantum, 
and the requested number of processors of each applica- 
tion. Once the processor allocation has been decided, 
the scheduler enforces it by suspending or resuming the 
application’s processes. The scheduler informs the 
applications about the number of processors assigned to 
each one and applications are in charge of adapting their 
parallelism to their current allocation. In our work, the 
scheduler is a user-level application, and it must enforce 
the processor allocation through the native operating 
system calls such as suspend, or resume. Finally, the 
scheduler interacts with the queueing system to dynami- 
cally modify the multiprogramming level (3). The result 
is a multiprogramming level adapted to the particular 
characteristics of the running applications. 


1. A typical quantum value is 100 ms 
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Figure 2: General overview 
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3.1 Dynamic Performance Analysis: SelfA- 
nalyzer 


The SelfAnalyzer [Corbalan99] is a run-time library that 
dynamically calculates the speedup achieved by the par- 
allel regions, and estimates the execution time of the 
whole application. The SelfAnalyzer exploits the itera- 
tive structure of a significant number of scientific appli- 
cations. The main time-consuming code of these 
applications is composed of a set of parallel loops inside 
a sequential loop. Iterations of the sequential loop have 
a similar behavior among them. Then, measurements for 
a particular iteration can be considered to predict the 
behavior of the next iterations, also exploited in 
([Nguyen96]. 


We believe that the speedup should be calculated as the 
relationship between two measurements: the sequential 
or reference execution time and the parallel execution 
time. In [Corbalan99] we demonstrated that the speedup 
calculated as a function of only one measurement can 
not detect significant issues such as the super-linear 
speedups. Figure 3 shows the formulation used by the 
SelfAnalyzer to calculate the speedup and to estimate the 
execution time. 


To calculate the speedup, the SelfAnalyzer measures the 
execution time of each outer sequential iteration and 
also monitors the sequential and parallel regions inside 
the outer loop. It executes some initial iterations of the 
sequential loop with a predefined number of processors, 
(baseline), to be used as reference for the speedup com- 


(1) S(p) = (baseline) . 4 (Baseline), where AF(Baseline) = 


Php 


(3) ExTime(p) = ConsumedTime +( 


AF (Baseline) x T (baseline) 


putation. Once T(baseline) is computed, (1) in Figure 3, 
the application goes on measuring the execution time 
but with the number of processors allocated by the 
scheduler. If baseline is one processor, the calculated 
speedup will correspond with the traditional speedup 
measurement. Since the execution of some initial itera- 
tions with one processor could consume a lot of time, 
we propose to set the baseline greater than one proces- 
sor. In [Corbalan99] we demonstrate that setting base- 
line to four processors is a good trade-off between the 
information provided by the measurement and the 
amount of overhead introduced because of executing the 
first iterations with a small number of processors. How- 
ever, this approach has the drawback that it does not 
allow us to directly compare speedups among applica- 
tions. Setting baseline to four processors, the speedup 
with four processors of an application that scales well 
will be one and the speedup with four processors of an 
application that scales poorly will be also one. 


We use Amdahl’s law [Amdahl67] to normalize the 
speedups inside an application. Amdahl’s law bounds 
the speedup that an application can achieve with P pro- 
cessors based on the fraction of sequential code. 


We call this function the Amdahl’s Factor (AF), see (2) 
in Figure 3. In this way, we calculate the AF of the base- 
line and use this value to normalize the speedups calcu- 
lated by the SelfAnalyzer. 


Considering the characteristics of these parallel applica- 
tions, and taking into account their iterative structure, 


Me cag! (2) 
[y+ Glek) 
Baseline 


x ItersRemaining 





5(p) 


Figure 3: Calculation of the speedup and execution time estimation 
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Figure 4: PDPA: Application state diagram 


we are able to estimate the complete execution time of 
the application by using the calculated speedup and the 
number of iterations that the application executes, (3) in 
Figure 3. This estimation is calculated by adding the 
consumed execution time until the moment with the 
estimation of the remaining execution time. The remain- 
ing execution time is calculated as a function of the 
number of iterations not yet executed and the speedup 
that the application is achieving on each iteration. 


Tocalculate the speedup and the execution time, the Sel- 
fAnalyzer needs to detect the following instrumentation 
points in the code: the starting of the application, the 
iterative structure, and the start and end of each parallel 
loop. In the current implementation, the invocation of 
the SelfAnalyzer at these points can be done in two dif- 
ferent ways: (1) if the source code is available, the appli- 
cation can be re-compiled and the SelfAnalyzer calls can 
be inserted by the compiler. (2) If the source code is not 
available, both the iterative structure and the parallel 
loops are dynamically detected. 


When the source code is not available, we detect the 
instrumentation points using dynamic interposition 
(Serra2000]. Calls to parallel loops are identified by the 
address of the function that encapsulates the loop. This 
sequence of values (addresses) is passed to another 
mechanism that dynamically detects periodic patterns. It 
Teceives as input a dynamic sequence of values and it is 
able to determine whether they follow a periodic pat- 
tern. Once we detect the iterative parallel region, the 
performance analysis is started. 


In this case the number of times that the iterative struc- 
ture executes is not available. In that case, the SelfAna- 
lyzer is not able to estimate the execution time of the 
application and it assumes that the most useful charac- 
teristic to the scheduler is the execution time of one 
outer iteration. 


As far as the status of the performance calculation is 
concerned, applications can be internally in two differ- 


Efficiency(currenti<high_e; 
&& 
Efficiency(current}>low_e, 
ie o 
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ent states: either Performance Not yet Calculated 
(PNC), or Performance Calculated (PC). The applica- 
tion is in the PNC state when the speedup with the cur- 
rent number of assigned processors has not been yet 
calculated, and in the PC when the speedup has been 
calculated. At the start of the application and each time 
the processor allocation is changed, the application is in 
the PNC state. If the processor allocation is modified 
when the application is in the PNC state, the current cal- 
culations (speedup and execution time) are discarded, 
and a new calculation with the current number of pro- 
cessors is started. 


3.2 The Performance-Driven Processor 
Allocation: PDPA 


The PDPA allocates processors among the applications 
considering issues such as the number of processors 
used in the system, the speedup achieved by each appli- 
cation, and the estimation of the execution time of the 
whole application. The goal of the PDPA is to minimize 
the response time, while guaranteeing that the allocated 
processors are achieving a good efficiency. 


The PDPA considers each application to be in one of the 
states shown in Figure 4. These states correspond with 
trends of the performance of the application. These 
states and the transitions among them are determined 
both by the performance achieved by the application and 
by some policy parameters. The PDPA parameters are 
the target efficiency (high_eff), the minimum efficiency 
considered acceptable (low_eff), and the number of pro- 
cessors that will increment/decrement the application 
allocation (step). In Section 3.2.2 we will present the 
solution adopted in the current approach to define these 
parameters. 


3.2.1 Application state diagram 


The PDPA can assign four different states to applica- 
tions: NO_REF(initial state), DEC, INC, and STABLE 
(see Figure 4). Each quantum the PDPA processes the 
performance information provided by the applications, 
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compared with the performance achieved in the previ- 
ous quantum, and with the policy parameters, and 
decides the application state for next quantum. The state 
transitions determine the processor allocation for this 
application in the next quantum, even if the next state is 
the same. 


All the applications start in the NO_REF state. This state 
means that the PDPA has no performance knowledge 
about this application (at the starting point). The proces- 
sor allocation associated with the starting of a new 
application is the same as an equipartition (approxi- 
mately total_processors_machine/total_applications), if 
there are enough free processors, otherwise it assigns 
the available free processors. Once the PDPA is 
informed about the achieved speedup with the previous 


allocation, it compares the efficiency! with high_eff and 
low_eff. If the efficiency is greater than high_eff, the 
PDPA considers that the application performs well and 
sets the next state as INC. If it is lower than low_eff, the 
PDPA considers that the application performs poorly 
and sets the next state as DEC. Finally, the PDPA may 
consider that the application has an acceptable perfor- 
mance that does not justify a change and the PDPA sets 
the next state as STABLE. 


If the next state is INC, the application will receive in 
the next quantum the current number of allocated pro- 
cessor plus step. If the next state is DEC the application 
will receive in the next quantum the current number of 
allocated processor minus step. If the next state is STA- 
BLE the processor allocation will be maintained. 


The INC state means that the application has performed 
well until the current quantum. In this state the PDPA 
uses both the speedup and the estimation of the execu- 


1. Calculated as the ratio between the speedup with P pro- 
cessors and P 
MoreProcessors() 


{ 


tion time to decide the next state. The MoreProcessors() 
algorithm presented in Figure 5 is executed to determine 
the next state. MoreProcessors() returning TRUE means 
that the additional processors associated to the transition 
to this state has provided a “real benefit” to this applica- 
tion. In that case the next state is set to INC. MorePro- 
cessors() returning FALSE means that the additional 
processors were not useful to the applications. In that 
case the next state is set to STABLE. If the next state is 
INC, the application will receive step additional proces- 
sors in the next quantum. If the next state is STABLE, 
the application will loose the step additional processors 
received in the last transition. 


The DEC state means that the application has performed 
badly until the current quantum. The LessProcessors() 
algorithm presented in Figure 5 is executed to determine 
the next state. LessProcessors() returning TRUE means 
that the application has not yet achieved an acceptable 
performance. In that case the next state will be DEC. 
LessProcessors() returning FALSE means that the per- 
formance is currently acceptable and the next state must 
be STABLE. If the next state is DEC, the application will 
loose step more processors in the next quantum. If the 
next state is STABLE the application will retain the cur- 
rent allocation. 


The STABLE state means that the application has the 
maximum number of processors that the PDPA consid- 
ers acceptable. Typically, once an application becomes 
STABLE it remains STABLE until it finishes. The alloca- 
tion in this state is maintained. Only if the policy param- 
eters are defined dynamically might the PDPA change 
the state of an application from STABLE to either INC or 
DEC. If low_eff has been increased and the efficiency 
achieved with the current allocation is not acceptable, 
the next state will be DEC and the application will loose 
step processors. In a symmetric way, if high_eff has 


RelativeSpeedup=ExTime(LastA llocation)/ExTime(current) 


IncrementProcessors=curtent/LastA llocation 


if ( Efficiency(current)>=high_eff) && 
Speedup(current)>Speedup(LastA location) && 
RelativeSpeedup>=(IncrementProcessors*high_eff)) retum TRUE 
else retum FALSE 
} 
LessProcessors() 


{ 
if (Efficiency(current)<low_eff) retum TRUE 
else retum FALSE 


} 


Figure 5: Algorithms to determine if the application achieves a good or bad performance 
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been decreased the next state will be JNC and the appli- 
cation will receive step additional processors. 


3.2.2 PDPA parameters 


As we have commented before, there are three parame- 
ters which determine the “aggressiveness” of the PDPA. 
These parameters can be either statically or dynamically 
defined. Statically defined, for instance by the system 
administrator, or dynamically defined, for instance as a 
function of the number of running applications. 


In the current PDPA implementation high_eff and 
low_eff are dynamically defined and step is statically 
defined. The PDPA calculates the values of high_effand 
low_eff at the start of each quantum, before processing 
the applications. The value of high_eff is calculated as a 
function of the ratio between the total number of proces- 
sors allocated in the last quantum and the number of 
processors in the system. We have adopted this solution 
because this ratio is a good hint about the level of scal- 
ability that the PDPA must require of parallel applica- 
tions to allocate them more processors. The higher this 
ratio is, the higher the high_eff value will be. Experi- 
mentally, the high_eff values ranges from 1.0 (ratio>0.9) 
to 0.7 (ratio<0.75). The value of low_eff is defined as a 
function of high_eff. In the current implementation it 
has been set to the value of high_eff minus 0.2. 


Step is a parameter that sets the increments or decre- 
ments in the allocation of an application. This parameter 
is used to limit the number of re-allocations that are suf- 
fered by the applications. Setting step to a small value 
we achieve more accuracy in the number of allocated 
processors but the overhead introduced by the re-alloca- 
tions can be significant. In the current implementation, 
this parameter has been tuned empirically and set to four 
processors. 


3.2.3 Implementation issues 


The PDPA checks the internal status of the applications 
and maintains the processor allocation to those applica- 
tions that are in the PNC state. Transitions in the state 
diagram are only allowed either when all the applica- 
tions are in the PC state or if there are unallocated pro- 
cessors. The aim of this decision is to maintain the 
allocation of those applications that are calculating their 
speedup. If we modify the speedup of an application in 
PNC state as a consequence of the processing of another 
application, it could result in inaccurate allocations. 


To those applications that are in PC state, the PDPA 
allocates a minimum of one processor. This decision has 
been taken considering that the efficiency of an applica- 
tion with one processor is 1.0. This assumption is also 
done in scheduling policies such as the equipartition and 
the equal_eff. Moreover, it simplifies the SelfAnalyzer 
and the PDPA implementation. 


Applications in PC state are sorted by speedup. This 
arrangement is done to give a certain priority to those 
applications that perform better, and assuring that these 
applications will receive processors. Finally, the PDPA 
maintains the history of the applications states, and does 
not allow that applications change from STABLE to 
either DEC or INC more than three times. The number 
of transitions is limited to avoid an excessive number of 
reallocations that will generate a loss of performance. It 
has been tuned empirically considering the particular 
characteristics of the workloads used. Further research 
with different workloads and applications will allow us 
to tune this parameter. 


3.2.4 Interface 


Table 1 shows the main primitives of the interface 
between the parallel library and the scheduler, and 
between the SelfAnalyzer and the scheduler. The first 
four rows are used by the parallel library to interact with 
the scheduler: requesting for cpus, checking the number 


Table 1: Interface 


Function 


int cpus_request(int P) 


Request for P cpus to the scheduler 


int cpus_current() Retums the number of cpus allocated to the application 


int cpus_preempted_work() 


work_t get_preempted_work() 
int cpus_speedup(int P, double speedup) 


int cpus_predicted_time(int P,double time) 


Returns the number of preempted threads 


Retums the pointer to the first preempted thread 


Sets the speedup achieved when P cpus are allocated to the application 
Sets the execution time estimated when P cpus are allocated to the application 
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of allocated cpus, checking whether there are preempted 
threads and recovering them. These are the main func- 
tions to implement the dynamic processor allocation 
mechanism. The last two primitives are used by the self- 
Analyzer to inform the scheduler about the calculated 
speedup and the estimation of the execution time of the 
application. 


3.3, Queueing system coordination: 
Dynamic multiprogramming level 


As we have commented before, the multiprogramming 
level defines the number of applications running concur- 
rently in the system. Non-clairvoyant scheduling poli- 
cies typically allocate as many processors as possible to 
the running applications, since they are not able to deter- 
mine how they will perform. They assign the minimum 
between the total requested number of processors and 
the number of processors of the machine. 


But, even when the total requested number of processors 
is greater than the total number of processors in the 
machine, the PDPA may decide to leave some proces- 
sors unallocated. In that case, the logical approach is to 
allow the queueing system to start a queued application. 
We propose to check after each re-allocation the sce- 
nario conditions and to decide whether a new applica- 
tion can be started. The conditions that must be met are 
the following: 


e Are there free processors? 

e Are all the running applications in the states STA- 
BLE, or DEC? 

e Even if there are some application in the INC phase, 
does the number of unused processors reach a cer- 
tain percentage? (currently defined by the adminis- 
trator in a 20%) 

These conditions are checked in the NewAppl() function 

call implemented by the scheduler and consulted by the 

queueing system. 


4 Execution environment and implementa- 
tion 

The work done in this paper has been developed using 
the NANOS execution environment: The NanosCom- 
piler, NthLib, and the CpuManager (the medium-term 
scheduler). 


Applications are parallelized through OpenMP direc- 
tives. They are compiled with the NanosCompiler 
[NANOS99], which generates code to NthLib 
[Martorell95][Martorell96]. NthLib constructs the struc- 
ture of parallelism specified in the OpenMP directives 
and it is able to adapt the structure of the application to 


the number of available processors. Moreover, it inter- 
acts with the CpuManager through a kernel interface in 
the following way: NthLib informs the scheduler about 
the number of requested processors and the scheduler 
informs NthLib about the number of processors avail- 
able to this application. 


The CpuManager [CorbalanML99] is a user-level pro- 
cessor scheduler. It implements the PDPA scheduling 
policy. It follows the approach proposed in [Tucker89], 
that assumes that applications perform better when the 
number of running threads is the same as the number of 
processors. 


For the following experiments, the CpuManager imple- 
ments the queueing system. Then, in this particular 
implementation it communicates with the PDPA by call- 
ing it directly. The queueing system launches a new 
application each time a running application finishes, and 
every quantum it asks to the PDPA whether a new appli- 
cation can be started. 


5 Evaluation 


In order to evaluate the practicality and the benefits of 
the PDPA we have executed several parallel workloads 
under different scenarios: 


Equip: Applications are compiled with the NanosCom- 
piler and linked with NthLib. The CpuManager is exe- 
cuted and it applies the equipartition policy proposed in 
([McCann93]. Equipartition is a space sharing policy 
that, to the extent possible, maintains an equal allocation 
of processors to all jobs. The initial allocation is set to 
zero. Then, the allocation number of each job is 
increased by one in turn, and any job whose allocation 


has reached the number of requested! processors drops 
out. This process continues until either there are no 
remaining jobs or until all P processors have been allo- 
cated. The only information provided by the application 
is its current processor requirements. 


PDPA: Applications are compiled with the NanosCom- 
piler and linked with NthLib. The CpuManager applies 
the PDPA scheduling policy. Three different variations 
have been executed to demonstrate the usefulness of the 
different components of our approach. (1) PDPA, as 
proposed in Section 3. (2) PDPA(S), the PDPA only 
considers the speedup. The benefit in the execution time 
provided by the extra processor allocation is not consid- 
ered. (3) PDPA(idleness), the speedup is calculated as a 


1. Specified as a command line parameter of the applica- 
tion or setting an environment variable 
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function of efficiency. In this case, we have tried to 
implement the approach proposed in [NguyenZV96], 
which calculates the efficiency measuring the sources of 
overhead: idleness, processor stall time, and system 
overhead. In our applications, we found the system 
overhead to be negligible, and in current architectures, 
like the Origin2000, the hardware does not provide the 
performance counters to calculate the processor stall 
time. Due to the difficulties of implementing their com- 
plete approach, we have implemented a_ similar 
approach only considering the idleness as source of 
overhead. 


Equal_eff: Applications are compiled with the 
NanosCompiler and linked with the NthLib. The Cpu- 
Manager applies the equal_eff proposed in 
(NguyenZV96]. The goal of the equal_eff is to maxi- 
mize the system efficiency. It uses the dynamically cal- 
culated efficiency of the applications, obtained through 
the SelfAnalyzer, to extrapolate [Dowdy88] the com- 
plete efficiency curve. Once extrapolated, the equal_eff 
works in the following way: it initially assigns a single 
processor to each application, and then it assigns the 
remaining processors one by one to the application with 
the currently highest (extrapolated) efficiency. 


SGI-MP: Applications are compiled with the MIPSpro 
F77 compiler and linked with the MP-library. The com- 
mercial IRIX scheduling policy has been used. In this 
case, the NANOS execution environment is not involved 
at all. The queueing system has been used to control the 
multiprogramming level. In this scenario, the environ- 
ment variables that define the application adaptability 


the values!: 


and 


have been set to following 
MP_BLOCK TIME=200000 


OMP_DYNAMIC=TRUE. 


5.1 Architecture, applications and work- 
loads 
All the workloads have been executed in an Origin2000 


(Laudon97][(SGI98] with 64 processors. Each processor 
is a MIPS R10000 [Yeager96] at 250 MHZ, with two 


1. These values have been tuned empirically to perform 
well under all the applications used in this work 


separated instruction and data L] cache (32 Kbytes), 
and a secondary unified instruction/data cache (4 
Mbytes). 


To evaluate our proposal we have selected four different 
applications: swim, hydro2d, apsi, and BT (class=A). 
The swim, hydro2d and apsi are applications from the 
SPECFp95, the BT is from the NASPB [Jin99]. Each 
one of them has different behavior considering the 
speedup. Table 2 presents the characteristics of these 
applications, from higher to lower speedup. Swim 
achieves a super-linear speedup, BT has a moderate- 
high speedup, hydro2d has low speedup and apsi has 
very bad speedup. In all the applications, except in apsi, 
the maximum speedup is achieved with 32 processors. 
The complete performance analysis of these applica- 
tions and their speedup curves can be found in 
(Corbalan99]. 


Compilation of benchmarks from the SPECFp has been 
done using the following command line options for the 
native MIPSpro f77 compiler: -64 -mips4 -r10000 - 
Ofast=ip27 -LNO:prefetch_ahead=1. Compilation of 
the BT has been done using the Makefile provided with 
the NASPB distribution. 


Table 3 describes the four different workloads designed 
to evaluate the performance of the PDPA. The column 
instances is the number of times that the application is 
executed and the request column is the number of 
requested processors. 


Workload | is designed to evaluate the performance of 
the PDPA when applications perform well, and the allo- 
cation of the equipartition policy directly achieves a 
good performance. Workload 2 has been designed to 
evaluate the PDPA performance when some of the appli- 
cations perform well and some perform badly. Workload 
3 evaluates the performance when applications have a 
medium and bad speedup and, finally, workload 4 evalu- 
ates the PDPA when all the applications have very bad 
performance. Since we are not assuming a priori know]- 
edge of the applications, we have set the requested num- 
ber of processors to 32 in all the applications. 


Table 2: Parallel applications 


BEA | bydro2atraim 
1066.21 sec. 


Characteristic/A pplication(input) swim(ref) 


Exec.Time. in Sequential 212.2 sec. 





Speedup with 8/16/32 processors. 21.6/36.5/44.2 6.1/12.4/20.85 4.6/5.4/6.3 0.93/0.93/0.92 
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Table 3: Workload description 





the executions. The queueing system applies a First 
Come First Served policy, and we assume that all the 


applications have been queued at the same time!. 


The dynamic page migration mechanism of IRIX has 
been activated and we have checked that results are 
slightly better than without this mechanism. 


5.2 Results 


Figure 6 presents the average execution time per appli- 
cation in the different scenarios for the four workloads. 
We also show the total execution time of the workloads 
under the different scheduling policies. Results from 
workload 1 show that the PDPA-based scheduling poli- 
cies (PDPA and PDPA(S)) perform well, compared with 
equipartition. The PDPA(idleness) does not perform 
well, demonstrating the importance of an accurate esti- 
mation of the performance. In this workload, the 


1. Instances from different applications have been merged 













Tequest 


equal_eff performs well since the applications can effi- 
ciently use a large number of processors. We can also 
appreciate the importance of considering the benefit 
provided by the additional processors to the applica- 
tions. If we observe the average execution time of swim, 
we see how the PDPA outperforms the PDPA(S). The 
reason is that the PDPA(S) allocates more processors to 
some instances of swim, allocating less processors to the 
rest of running applications. With the PDPA(S) the stan- 
dard deviation in the execution time of the different 
instances is greater than in PDPA. The execution time 
range is (6.5,14.6) in PDPA(S) and (6.5,8.5) in PDPA. 
The importance of considering the benefit provided by 
the additional processors is more significant when the 
load of the system is high. In that case, without consid- 
ering this parameter the processor allocation can 
become unfair. In the rest of workloads the difference 
between PDPA and PDPA(S) is less significant, since 
the load of the system is low. 


In the workload 2, the PDPA-based scheduling policies 
outperform the rest of scheduling policies. In this case, 





in the queue the workload execution time has been significantly 
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Figure 6: Per application(avg) and total execution time of the parallel workloads. 
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reduced because of the communication with the long- 
medium term scheduler. The speedup with respect to 
both the PDPA(idleness) and the SGI-MP is 3.2. 


Workload 3 does not show large differences in the indi- 
vidual performance, although the number of processors 
allocated to applications by the PDPA-based scheduling 
policies is very small, allowing the long-term scheduler 
to start a new application, resulting in a better system 
utilization. This better utilization can be observed in the 
execution time of the workload. The PDPA-based 
scheduling policies achieve speedups from 2 (with 
respect to the equip.) to 6.2 (with respect to the SGI- 
MP). 


Finally, in workload 4, the PDPA-based scheduling poli- 
cies outperform the rest, mainly in the execution time of 
the workload, and also in the individual performance. 
Allocating a small, but sufficient, number of processors 
to the apsi avoids undesirable memory interferences. 
Considering the workload execution time, the PDPA- 
based scheduling policies achieve speedups from 2.0 
with respect to the equip. to 6.76 with respect to the 
SGI-MP. 


We want to comment on the performance achieved in 
the case of the SGI-MP environment. The problem is the 
large number of unnecessary context-switches. These 
context-switches generate a loss of performance because 
they imply the reload of the data cache, remote memory 
accesses, and increase the system time consumed by the 
application. For instance, consider one apsi execution in 


PDPA(S) 





Equal_eff 
SGI-MP 







PDPA(idieness) 


swim 
mae BT.A 


Avg. Cpus allocated 


Workload 1 
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Workload 3 
Figure 7: Processor allocation (avg) of each application under the different scheduling policies 


Avg.Cpus allocated 


Avg.Cpus allocated 


the workload 4 in the PDPA and in the SGI-MP environ- 
ments. In the PDPA the apsi has consumed a 0.1% of the 
execution time in system mode (0.23sec. in system- 
mode and 204sec. in user-mode). In the SGI-MP case, 
the apsi has spent a 27% in system mode (152sec. in 
system mode and 562. 7sec. in user-mode). 


Figure 7 shows the processor allocation made by the dif- 
ferent scheduling policies when executing the parallel 
workloads. Each column shows the average of proces- 
sors allocated to each different application. In these 
graphs, we can observe how the scheduling policies that 
take into account the application characteristics distrib- 
ute the processors accordingly with the application per- 
formance. Since there are a minimum of two instances 
of each application running concurrently the highest of 
the columns should normally not exceed thirty-two pro- 
cessors (in the case of workload 4 sixteen). 


We can observe how PDPA and PDPA(S) distribute the 
processors proportionally to the application perfor- 
mance. PDPA(S) is less restrictive and it assigns more 
processors. On the other hand, equal_eff does not have a 
Tule to stop the processor allocation to the applications. 
This is the reason why the equal_eff allocates a higher 
number of processors to applications that perform badly, 
like apsi. PDPA(idleness) is not able to detect the good 
or bad behavior of the applications. The idleness is 
shown as a bad hint of the real efficiency achieved by 
the parallel applications. We can also observe in the case 
of the SGI-MP, how applications have adapted their par- 
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allelism to the available processors, in a similar way to 
the equip. 


6 Conclusions 


In this work, we have presented Performance-Driven 
Processor Allocation, a new scheduling policy that uses 
both global system information and the application 
characteristics provided by the SelfAnalyzer, a dynamic 
performance analyzer. PDPA allocates processors to 
applications that will take advantage of them, avoiding 
unfair allocations, allocating processors to applications 
that do not benefit from them, or even prejudicial alloca- 
tions, resulting in an increase in the execution time. 


This work has been implemented and evaluated on an 
SGI Origin2000. We have demonstrated that it is impor- 
tant for the scheduler to receive accurate information 
about the application characteristics. Our evaluation 
shows that PDPA outperforms the considered schedul- 
ing policies. 


Finally, in this work we have considered the usefulness 
of the interaction between the medium and the long- 
term scheduler. Our experience has shown that it is con- 
venient to allow this kind of communication to improve 
the performance of the global system. This conclusion is 
valid for PDPA and also to any scheduling policy that 
allocates processors to applications based upon their 
performance. 
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Abstract 


Pocket computers are beginning to emerge that provide 
sufficient processing capability and memory capacity to 
run traditional desktop applications and operating sys- 
tems on them. The increasing demand placed on these 
systems by software is competing against the continu- 
ing wend in the design of low-power microprocessors to- 
wards increasing the amount of computation per unit of 
energy. Consequently, in spite of advances in low-power 
circuit design, the microprocessor is likely to continue 
to account for a significant portion of the overall power 
consumption of pocket computers. 


This paper investigates clock scaling algorithms on the 
Itsy, an experimental pocket computer that runs a com- 
plete, functional multitasking operating system (a ver- 
sion of Linux 2.0.30). We implemented a number of 
clock scaling algorithms that are used to adjust the pro- 
cessor speed to reduce the power used by the proces- 
sor. After testing these algorithms, we conclude that cur- 
rently proposed algorithms consistently fail to achieve 
their goal of saving power while not causing user appli- 
cations to change their interactive behavior. 


1 Introduction 


Dynamic clock frequency scaling and voltage scaling are 
two mechanisms thatcanreduce the power consumed by 
a computer. Both voltage scaling and frequency scaling 
are important; the power consumed by a component im- 
plemented in CMOS varies linearly with frequency and 
quadratically with voltage. 


To evaluate the relative importance and the situations in 
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which either is useful, it is necessary to consider energy, 
the integral of power over time. By reducing the fre- 
quency at which a component operates, a specific oper- 
ation will consume less power but may take longer to 
complete. Although reducing the frequency alone will 
reduce the average power used by a processor over that 
period of time, it may not deliver a reduction in energy 
consumption overall, because the power savings are lin- 
early dependent on the increased time. While greater 
energy reductions can be obtained with slower clocks 
and lower voltages, operations take longer; this exposes 
a fundamental tradeoff between energy and delay. 


Many systems allow the processor clock to be varied. 
More recently, there are a number of processors that 
allow the processor voltage to be changed. For exam- 
ple, the StrongARM SA-2 processor, currently being 
designed by Intel, is estimated to dissipate 500mW at 
600MHz, but only 40mW when running at 1SOMHz — 
a 12-fold energy reduction for a 4-fold performance re- 
duction [1]. Likewise, the Pentium-III processor with 
SpeedStep technology dissipates 9W at SOOMHz but 
22W at 650MHz (2], AMD has added clock and voltage 
scaling to the AMD Mobile K6 Plus processor family 
and Transmeta has also developed processors with volt- 
age scaling. Because of this tradeoff in speed vs. power, 
the decision of when to change the frequency or the volt- 
age and frequency of such processors must be made ju- 
diciously while taking into account application demand 
and quality of user experience. 


We believe that the decision to change processor speed 
and voltage must be controlled by the operating system. 
The operating system or similar system software is the 
only entity with a global view of resource usage and de- 
mand. Although it is clear that the operating system 
should control the scheduling mechanism, it is not clear 
what inputs are necessary to formulate the scheduling 
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policy. There are two possible sources of information for 
policies. The application can estimate activity, providing 
information to the operating system about computation 
rates or deadlines, orthe operating system can attempt to 
infer some policy for the applications from their behav- 
ior. These can be used separately or in concert to control 
voltage and processor speed. 


A number of studies have investigated policies to auto- 
matically infer computation demands and adjust the pro- 
cessor accordingly. We have implemented those previ- 
ously described algorithms; this paper describes our ex- 
perience. 


In the next section, we present some background mate- 
rial. We discuss related work in Section 3. In Section 4 
we describe the schedulers we examine, our workload 
and our measurement methodology. We then discuss our 
results in Section 5. 


2 Background 


Tobetter understand the importance of voltage and clock 
scheduling, we begin by reviewing energy-consumption 
concepts, then present an overview of scheduling algo- 
rithms. Lastly, we give an overview of our test platform, 
the Itsy Pocket Computer. 


2.1 Energy 


The energy E, measured in Joules (J), consumed by 
a computer over T seconds is equal to the integral of 
the instantaneous power, measured in Watts (W). The 
instantaneous power consumed by components imple- 
mented in CMOS, such as microprocessors and DRAM, 
is proportional to V 2 x F, where V is the voltage supply- 
ing the component, and F is the frequency of the clock 
driving the component. Thus, the power consumed by 
a computer to, say, search an electronic phone book, 
may be reduced by reducing V, F’, or both. However, 
for tasks that require a fixed amount of work, reducing 
the frequency may result in the system taking more time 
to complete the work. Thus, little or no energy will be 
saved. There are techniques that can result in energy sav- 
ings when the processor is idle, typically through clock 
gating, which avoids powering unused devices. 


In normal usage pocket computers run on batteries, 
which contain a limited supply of energy. However, as 
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discussed in [3], in practice, the amount of energy a bat- 
tery can deliver (i.e., its capacity) is reduced with in- 
creased power consumption. As an illustration of this 
effect, consider the Itsy pocket computer that was used 
in this study (described in Section 2.3). When the sys- 
tem is idle, the integrated power manager disables the 
processor core but the devices remain active. If the sys- 
tem clock is 206 MHz, a typical pair of alkaline batteries 
will power the system for about 2 hours; if the system 
clock is set to 59 MHz, those same batteries will last for 
about 18 hours. Although the battery lifetime increased 
by a factor of 9, the processor speed was only decreased 
by a factor of 3.5. The capacity of the battery can also 
be increased by interspacing periods of high power de- 
mand with much longer periods of low power demand 
resulting in a “‘pulsed power” system [4]. The extent to 
which these two non-ideal properties can be exploited 
is highly dependent on the chemical properties and the 
construction of a battery as well as the conditions un- 
der which the battery is used. In general, the former ef- 
fect (minimizing peak demand) is more important than 
the latter for the domain of pocket computers because 
pulsed power systems need a significant period of time 
to recharge the battery, and most computer applications 
place a more constant demand on the battery. 


If a system allows the voltage to be reduced when clock 
speed is reduced (i.e. it supports voltage scaling), it 
is better to reduce the clock speed to the minimum 
needed rather than running at peak speed and then being 
idle. For example, consider a computation that normally 
takes 600 million instructions to complete. That appli- 
cation would take one second on a StrongARM SA-2 at 
600MHz and would consume 500 mJoules. At 1SOMHz, 
the application would take four seconds to complete, 
but would only consume 160 mJoules, a four-fold sav- 
ings assuming that an idle computer consumes no en- 
ergy. There is obviously a significant benefit to running 
slower when the application can tolerate additional de- 
lay. Pering [5] used the term voltage scheduling to mean 
scheduling policies that seek to adjust both clock speed 
and energy. The goal of voltage scheduling is to reduce 
the clock speed such that all work on the processor can 
be completed “on time” and then reduce the voltage to 
the minimum needed to insure stability at that frequency. 


2.2 Clock Scheduling Algorithms 


In scheduling the voltage at which a system operates and 
the frequency at which it runs, a scheduler faces two 
tasks: to predict what the future system load will be 
(given past behavior) and to scale the voltage and clock 
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frequency accordingly. These two tasks are referred to as 
prediction and speed-setting [6]. We consider one sched- 
uler better than another if it meets the same deadlines (or 
has the same behavior) as another policy but reduces the 
clock speed for longer periods of time. 


The schedulers we implemented are interval schedulers, 
so called because the prediction and scaling tasks are 
performed at fixed intervals as the system runs [7]. At 
each interval, the processor utilization for the interval is 
predicted, using the utilization of the processor over one 
or more preceding intervals. We consider two predic- 
tion algorithms originally proposed by Weiser et al. [7]: 
PAST and AVGy. Under PAST, the current interval is 
predicted to be as busy as the immediately preceding in- 
terval, while under AVG, an exponential moving average 
with decay NV of the previous intervals is used. That is, 
at each interval, we compute a “weighted utilization” at 
time t, W;, as a function of the utilization of the previ- 
ous interval U;_1 and the previous weighted utilization 
W,_-1. The AVGy policy sets Wy = NaWeat Urey | The 
PAST policy is simply the AVG g policy, and assumes the 
current interval will have the same resource demands as 
the previous interval. 


The decision of whether to scale the clock and/or voltage 
is deterinined by a pair of boundary values used to pro- 
vide hysteresis to the scheduling policy. If the utilization 
drops below the lower value, the clock is scaled down; 
similarly, if the utilization rises above the higher value, 
the clock is scaled up. Pering et al. [8] set these values at 
50% and 70%. We used those values as a starting point 
but, as we discuss in Section 5.3, we found that the spe- 
cific values are very sensitive to application behavior. 


Deciding how much to scale the processor clock is sep- 
arate from the decision of when to scale the clock up 
(or down). The SA-1 100 processor used in the Itsy sup- 
ports 11 different clock rates or ‘clock steps”. Thus, our 
algorithms must select one of the discrete clock steps. 
We use three algorithms for scaling: one, double, and 
peg. The one policy increments (or decrements) the 
clock value by one step. The peg policy sets the clock 
to the highest (or lowest) value. The doub1e policy 
tries to double (or halve) the clock step. Since the low- 
est clock step on the Itsy is zero, we increment the clock 
index value before doubling it. Separate policies may be 
used for scaling upwards and downwards. 





Figure 1: Equipment setups used to measure power. 


2.3 The Itsy Pocket Computer 


The Itsy Pocket Computer is a flexible research plat- 
form, developed to enable hardware and software re- 
search in pocket computing. It is a small, low-power, 
high-performance handheld device with a highly flexible 
interface, designed to encourage the development of in- 
novative research projects, such as novel user interfaces, 
new applications, power management techniques, and 
hardware extensions. There are several versions of the 
basic Itsy design, with varying amount of RAM, flash 
memory and I/O devices. We used several units for this 
study that were modified by Compaq Computer Corpo- 
tation’s Western Research Lab to include instrumenta- 
tion leads for power measurement. Figure | shows the 
units along with the measurement equipment we used. 
We investigate the energy and power consumption of 
the Itsy Pocket Computer when it is run at between 
59 MHz and 206 MHz, and when its StrongARM SA- 
1100 [9, 10] processor is powered at two different volt- 
age levels. 


All versions of the Itsy are based on the low-power 
StrongARM SA-1100 microprocessor. All versions 
have a small, high-resolution display, which offers 320 x 
200 pixels on a 0.18mm pixel pitch, and 15 levels of 
greyscale. All versions also include a touchscreen, a mi- 
crophone, a speaker, and serial and IrDA communica- 
tion ports. The Itsy architecture can support up to 128 
Mbytes both of DRAM and flash memory. The flash 
memory provides persistent storage for the operating 
system, the root file system, and other file systems and 
data. Finally, the Itsy also provides a “daughter card” 
interface that allows the base hardware to be easily ex- 
tended. The Itsy uses two voltage supplies powered by 
the same power source. The processor core is driven by a 
1.5 V supply while the peripherals are driven by a 3.3 V 
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Figure 2: Itsy System Architecture 


supply. Both power supplies are driven by a single 3.1V 
supply connected to the electrical mains. 


The Itsy version 1.5 units used as the basis for this work 
have 64 Mbytes of DRAM and 32 Mbytes of flash mem- 
ory. These units were modified to allow us to run the 
StrongARM SA-1100 at either 1.5 V or 1.23 V. Al- 
though 1.23V is below the manufacturer’s specifica- 
tion, it can be safely used at moderate clock speeds and 
our measurements indicate the voltage reduction yields 
about a 15% reduction in the power consumed by the 
processor; the percentage of power reduction for the sys- 
tem may be less than this depending on workload) be- 
cause voltage scaling only reduces the power used by the 
processor. The Itsy can be powered either by an external 
supply or by two size AAA batteries. Figure 2 shows a 
schematic of the Itsy architecture. 


The system software of the Itsy includes a monitor and 
a port of version 2.0.30 of the Linux operating sys- 
tem. The Linux system was configured to provide sup- 
port for networking, file systems and multi-user manage- 
ment. Applications can be developed using a number of 
programming environments, including C, X-Windows, 
SmallTalk and Java. Applications can also take advan- 
tage of available speech synthesis and speech recogni- 
tion libraries. 


3 Related Work 


We believe that our evaluation of dynamic speed and 
voltage setting algorithms to be the first such empirical 
evaluation — to our knowledge, all previous work from 
different groups has relied on simulators [7, 6, 5, 11, 12); 
none modeled a complete pocket computer or the work- 
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load likely to be run on it. 


Weiser et al. [7] proposed three algorithms, OPT, 
FUTURE, and PAST and evaluated them using traces 
gathered from UNIX-based workstations running engi- 
neering applications. These algorithms use an interval- 
based approach that determines the clock frequency for 
each interval. Of the algorithms they propose, only 
PAST is feasible because it does not make decisions us- 
ing future information that would not be available to an 
actual implementation. Even so, the actual version of 
PAST proposed by by Weiser et al. is not implementable 
because it requires that the scheduler know the amount 
of work that had to be performed in the preceding in- 
tervals. This information was used by the scheduler to 
choose a clock speed that allows this delayed work to be 
completed in the next interval, if possible. For example, 
suppose post-processing of a trace revealed that the pro- 
cessor was busy 80% of the cycles while running at full 
speed. If, during re-play of the trace, the scheduler opted 
to run the processor at 50% speed for the interval, then 
30% of the work could not be completed in that interval. 
Consequently, in the next interval, the scheduler would 
adjust the speed in an effort to at least complete the 30% 
“unfinished” work. Without additional information from 
the application, the scheduler can simply observe that 
the application executed until the end of the scheduling 
quanta, and does not know the amount of “unfinished” 
computing left. Because most pocket computer applica- 
tions do not provide a means for the processor to know 
how much work should be done in a given interval, the 
PAST algorithm is not tractable for such systems. 


The early work of Weiser et al. has been extended by 
several groups, including [6, 12]. Both of these groups 
employed the same assumptions and the same traces 
used by Weiser. Govil et al. [6] considered a large num- 
ber of algorithms, while Martin [12] revised Weiser’s 
PAST algorithm to account for the non-ideal properties 
of batteries and the non-linear relationship between sys- 
tem power and clock frequency. Martin argues that the 
lower bound on clock frequency should be chosen such 
that the number of computations per battery lifetime is 
maximized. While Martin correctly assumed a non-zero 
energy cost for idling the processor and changing clock 
speed, neither Govil nor Weiser did. 


Both our work and that of Pering et al. (5, 11] ad- 
dresses some of the limitations of the above noted ear- 
lier work. In particular, we both evaluate implementable 
algorithms using workloads that are representative of 
those that might be run on pocket computers. We as- 
sess the success of our algorithms under the assumption 
that our applications have inelastic performance con- 
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straints and that the user should see no visible changes 
induced by the scheduling algorithms. By comparison, 
Pering et al. assume that frames of an MPEG video, 
for instance, can be dropped and present results which 
combine a combination of energy savings vs. frame 
rates. Our goal was to understand the performance of 
the different scheduling algorithms without introducing 
the complexity of comparing multi-dimensional perfor- 
mance metrics such as the percentage of dropped frames 
vs. power savings. 


Pering et al. use intervals of 10-50ms for their schedul- 
ing calculations. In comparison to the earlier approaches 
presented in [7, 6, 12] in which work was considered 
overdue if it was not completed within an interval, both 
Pering et al. and our study consider an event to have 
occurred on time if delaying its completion did not ad- 
versely affect the user. However, a number of impor- 
tant differences exist between our work and Pering et 
al.. First, Pering et al. model only the power consumed 
by the microprocessor and the memory, thus ignoring 
other system components whose power is not reduced 
by changes in clock frequency. Second, by virtue of 
our work using an actual implementation, we are able 
to evaluate longer running applications and more com- 
plex applications (e.g., Java). By virtue of their size, 
our applications exhibit more significant memory behav- 
ior, and thus, expose the non-linear relationship between 
power and clock speed noted by Martin. Lastly, by us- 
ing an actual system, our scheduling implementations 
were exposed to periodic behaviors that are captured 
by traces; for example, the Java implementation uses a 
30ms polling loop to check for I/O events. This periodic 
polling adds additional variation to the clock setting al- 
gorithms, inducing the sort of instability we will explain 
in 85.3. 


4 Methodology 


Before describing the implementation of the clock and 
voltage scheduling algorithms we used, it is important to 
understand how we did our measurements. Section 4.1 
describes how we measure power and energy. We then 
describe the implementation of the schedulers and the 
workloads we used to assess their performance. 


4.1 Measuring Power and Total Energy 


To measure the instantaneous power consumed by the 
Itsy, we use a data acquisition (DAQ) system to record 
the current drawn by the Itsy as it is connected to an ex- 
ternal voltage supply, and the voltage provided by this 
supply. Figure 1 presents a picture of our setup along 
with the wires connected to the Itsy to facilitate mea- 
suring the supply current! and voltage. We configured 
the DAQ system to read the voltage 5000 times per sec- 
ond, and convert these readings to 16-bit values. These 
values were then forwarded to a host computer, which 
stored them for subsequent analysis. From these mea- 
surements, we can compute a time profile of the power 
used by an application as it runs on the Itsy. 


To determine the relevant part of the power-usage pro- 
file of a workload, we measure the time required to ex- 
ecute the workload and then select the relevant set of 
measurements from the data collected by the DAQ sys- 
tem. For each benchmark, we used the get timeof - 
day system call to time its execution; this interface uses 
the 3.6 MHz clock available on the processor to provide 
accurate timing information. To synchronize the collec- 
tion of the voltages with the start of execution of a work- 
load, as the workload begins executing, we toggle one 
of the SA1100’s general-purpose input-output (GPIO) 
pins. This pin is connected to the external trigger of the 
DAQ system; toggling the GPIO causes the DAQ system 
to begin recording measurements. As our measurement 
technique is very similar to that which we used in [13], 
we refer the reader to this reference for a more in-depth 
description. 


Once the relevant part of the profile has been deter- 
mined, we use it to calculate the average power and 
the total energy consumed by the Itsy during the cor- 
responding time interval. To compute the energy, we 
make the assumption that the power measured at time 
t represents the average power of the Itsy for the inter- 
val ¢ to ¢ + 0.0002 seconds, where 0.0002 seconds is 
the time between each successive power measurement. 
Thus, the energy E is equal to 5>;-_, pi(t) x 0.0002, 
where p;(t),...,Pn(t) are the n power readings of in- 
terest. 


In making our power measurements, we used a simi- 
lar approach as the one used in [13] to reduce a num- 
ber of sources of possible measurement error. We mea- 

'The supply current was measured by measuring the voltage drop 
across a high precision small-valued resistor of a known resistance 


(0.022). The current was then calculated by dividing the voltage by 
the resistance. 
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sured multiple runs of each workload; in general, we 
found the 95% confidence interval of the energy to be 
less than 0.7% of the mean energy. This implies that the 
runs were very repeatable, despite the possible variation 
that would arise from interactions between application 
threads, other processes and system daemons. 


4.2 Workload 


We used a varied workload to assess the performance 
of the different clock scaling algorithms. Since it’s not 
clear what applications will be common on pocket com- 
puters, we used some obvious applications (web brows- 
ing, text reading) and other less obvious applications 
(chess, mpeg video and audio). The applications ran 
either directly on top of the Linux operating system or 
within a Java virtual machine [14]. To capture repeat- 
able behavior for the interactive applications, we used 
a tracing mechanism that recorded timestamped input 
events and then allowed us to replay those events with 
millisecond accuracy. We did not trace the mpeg play- 
back because there is no user interaction, and we found 
little inter-run variance. We used the following applica- 
tions: 


MPEG: We played a 320x200 color MPEG-1 video 
and audio clip at 15 frames a second. The mpeg 
video was rendered as a greyscale image on the 
Itsy. Audio was rendered by sending the au- 
dio stream as a WAV file to an audio player 
which ran as a separate process, forked from the 
video player. There is no explicit synchroniza- 
tion between the audio and video sequences, but 
both are sequenced to remain synchronized at 15 
frames/second. The clip is 14 seconds and was 
played in a loop to provide 60 seconds of play- 
back. 


Web: We used a Javabean version of the IceWeb 
browser to view content stored on the itsy. 
We selected a file containing a stored article 
from www .news .com concerning the Itsy. We 
scrolled down the page, reading the full article. 
We then went back to the root menu and opened a 
file containing an HTML version of WRL techni- 
cal report TN-56, which has many tables describ- 
ing characteristics of power usage in Itsy compo- 
nents. The overall trace was 190 seconds of activ- 


ity. 


Chess: We used a Java interface to version 16.10 of the 
Crafty chess playing program. Crafty was run as 


a separate process. Crafty uses a play book for 
opening moves and then plays for specific periods 
of time in later stages of the games and plays the 
best move available when time expires. The 218 
second trace includes a complete game of Crafty 
playing against a novice player (who lost, badly). 


TalkingEditor: We used a version of the “mpedit” Java 
text editor that had been modified to read text files 
aloud using the DECtalk speech synthesis system 
(which is run in a separate process). The input 
trace records the user selecting a file to be opened 
using the file dialogue, (i.e. moving to the direc- 
tory of the short text file and selecting the file), 
then having it spoken aloud and finally opening 
and having another text file read aloud. The trace 
took 70 seconds. 


The Kaffe Java system [14] uses a JIT, makes extensive 
use of dynamic shared libraries and supports a threading 
model using setjmp/longjmp. The graphics library used 
by Java is a modified version of the publically available 
GRX< graphics library and uses a polling /O model to 
check for new input every 30 milliseconds. The MPEG 
player renders directly to the display. 


4.3 Implementing the Scheduling Algorithms 


We made two modifications to the Linux kernel to sup- 
port our clock scheduling algorithms and data record- 
ing. The first modification provides a log of the process 
scheduler activity. This component is implemented as 
a kernel module with small code modifications to the 
scheduler that allow the logging to be turned on and 
off. For each scheduling decision, we record the pro- 
cess identifier of the process being scheduled, the time 
at which it was scheduled (with microsecond resolution) 
and the current clock rate. 


We also implemented an extensible clock scaling policy 
module as a kernel module. We modified the clock in- 
terrupt handler to call the clock scheduling mechanism 
if it has been installed, and the Linux scheduler to keep 
track of CPU utilization. In Linux, the idle process al- 
ways uses the zero process identifier. The idle process 
enters a low-power “nap” mode that stalls the processor 
pipeline until the next scheduling interval. If the previ- 
ous process was not the idle process, the kernel adds the 
execution time to a running total. On every clock inter- 
tupt, this total is examined by the clock scaling module 
and then cleared. The CPU utilization can be calculated 
by comparing the time spent non-idle to the time length 
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of a quantum. Our time quantum was set to 10 msec, the 
default scheduling period in Linux; Pering et al. [5, 11] 
used similar values for their calculations. 


Normally, a process can run for several quanta before the 
scheduler is called. The executing process is interrupted 
by the 100Hz system clock when the O/S decrements 
and examines a counter in the process control block at 
each interrupt. When that counter is zero, the scheduler 
is called. We set the counter to one each time we sched- 
ule a process, forcing the scheduler to be called every 
10ms. While this modification adds overhead to the exe- 
cution of an application, it allows us to control the clock 
scaling more rapidly. We measured the execution over- 
head and found it to be very small (about 6 microseconds 
for each 10ms interval, or 0.06%). 


5 Results 


The purpose of our study is to determine if the heuris- 
tics developed in prior studies can be practically applied 
to actual pocket computers. We examined a number of 
policies, most of which are variants of the AVG y policy. 
As described in 84.3, we used three different speed set- 
ting policies. Our intent was to focus on systems that 
could be implemented in an actual O/S and that did not 
require modifications to the applications (such as requir- 
ing information about deadlines or schedules). We as- 
sumed that our workloads had inelastic constraints; in 
other words, we assumed the applications had no way to 
accommodate “missed deadlines”. 


We split the discussion of our results into three parts. 
The first section describes aspects of the applications 
and how they differ from those used in prior work and 
the second section discusses the performance of the dif- 
ferent clock scheduling algorithms. Finally, we examine 
the benefit of the limited voltage scaling available on the 
Itsy and summarize the results. 


5.1 Application Characteristics 


Figure 3 presents plots of the processor utilization over 
time for each of the benchmark applications. This infor- 
mation was gathered using the on-line process logging 
facility that we added to the kernel. Due to kernel mem- 
ory limitations, we could only capture a subset of the 
process behavior. Each application was able to run at 
132MHz and still meet any user interaction constraints 
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Figure 3: Utilization using 10ms Moving Average For 
Between 30 to 40 Second Intervals Using 206MHz Fre- 
quency Setting 
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(ie. the application did not appear to behave any differ- 
ently). 


The utilization is computed for each 10ms scheduling 
quantum. We used the same 10ms interval for logging 
that is used for scheduling within Linux. Since most pro- 
cesses compute for several quanta before yielding, the 
system is usually either completely idle or completely 
busy during a given quantum. Some processes execute 
for only a short time then yield the processor prior to 
the end of their scheduling quanta; for example, the Java 
implementation we used has a 30ms I/O polling loop — 
thus, when the Java system is “idle,” there is a constant 
polling action every 30ms that takes about a millisecond 
to complete. 


The behavior of the applications is difficult to predict, 
even for applications that should have very predictable 
behavior and each application appears to run at a dif- 
ferent time-scale. The MPEG application renders at 15 
frames/sec; there are 450 frames in the 30 second in- 
terval shown in Figure 3. Each frame is rendered in 
67ms or just under 7 scheduling quanta. Any scheduling 
mechanism attempting to use information from a single 
frame (as opposed to a single quanta) would need to ex- 
amine at least 7 quanta. Other applications have much 
coarser behavior. For example, the TalkingEditor appli- 
cation consumes varying amount of CPU time until the 
text is being loaded for speech synthesis. The bursty 
behavior prior to the speech synthesis results from drag- 
ging images, JIT’ing applications and opening files. Fol- 
lowing this are long bursts of computation as the text 
is actually synthesized and send to the OSS-compatible 
sound driver. Finally, more cycles are taken by the sound 
driver. Thus, this application is bursty at a higher level. 


For most applications, patterns in the utilization are eas- 
ier to see if you plot the utilization using a 100ms mov- 
ing average, as shown in Figure 4. The MPEG appli- 
cation, in Figure 4(a), is still very sporadic because of 
inter-frame variation; for MPEG, there is even signif- 
icant variance in CPU utilization (60-80%) when con- 
sidering a | second moving average (not shown). The 
Chess and TalkingEditor applications show patterns in- 
fluenced by user interaction. It’s clear from Figure 4(c) 
that utilization is low when the user is thinking or mak- 
ing a move and that utilization reaches 100% when 
Crafty is planning moves. Likewise, Figure 4(d) shows 
the aforementioned pattern of synthesis and sound ren- 
dering. 
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Figure 4: Utilization using 100ms Moving Average For 
Between 30 to 40 Second Intervals Using 206MHz Fre- 
quency Setting 
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5.2 Clock Scheduling Comparison 


The goal of a clock scheduling algorithm is to try to 
predict or recognize a CPU usage pattern and then set 
the CPU clock speed sufficiently high to meet the (pre- 
dicted) needs of that application. Although patterns in 
the utilization are more evident when using a 100ms 
sliding average for utilization, we found that averaging 
over such a long period of time caused us to miss our 
“deadline”. In other words, the MPEG audio and video 
became unsynchronized and some others applications 
such as the speech synthesis engine had noticeable de- 
lays. This occurs because it takes longer for the system 
to realize it is becoming busy. 


This delay is the reason that the studies of Govil et al. [6] 
and Weiser [7] argued that clock adjustment should ex- 
amine a 10-50ms interval when predicting future speed 
settings. However, as Figure 3 shows, it is difficult to 
find any discernible pattern at the smaller time-scales. 
Like Govil et al., we also allowed speed setting to occur 
at any interval; Weiser et al. did not model having the 
scheduler interrupted while an application was mnning, 
but rather deferred clock speed changes to occur only 
when a process yielded or began executing in a quanta. 


There are a number of possible speed-setting heuristics 
we could examine; since we were focusing on imple- 
mentable policies, we primarily used the policies ex- 
plored by Pering et al. [5]. We also explored other al- 
ternatives. One simple policy would determine the num- 
ber of “busy” instructions during the previous NV 10ms 
scheduling quanta and predict that activity in the next 
quanta would have the same percentage of busy cycles. 
The clock speed would then be set to insure enough busy 
cycles. 


This policy sounds simple, but it results in exception- 
ally poor responsiveness, as illustrated in Figure 5. Fig- 
ure 5(a) shows the speed changes that would occur when 
the application is moving from period of high CPU uti- 
lization to one of low utilization; the speed changes to 
59MHz relatively quickly because we are adding in a 
large number of idle cycles each quanta. By compar- 
ison, when the application moves from an idle period 
to a fully utilized period, the simple speed setting pol- 
icy makes very slow changes to the processor utilization 
and thus the processor speed increases very slowly. This 
occurs because the total number of non-idle instructions 
across the four scheduling intervals grows very slowly. 
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Figure 5: Simple averaging behavior results in poor poli- 
cies. Each box represents a single scheduling interval, 
and the scheduling policy averages the number of non- 
idle instructions over the four scheduling quanta to select 
the minimum processor speed. To simplify the example, 
we assume each interval is either fully utilized or idle. 
The notation “206/0” means the CPU is set to 206MHz 
and the quanta is idle while “206/1” means the CPU is 
fully utilized. 


5.3. The AVGy Scheduler 


We had initially thought that a policy targeting the neces- 
sary number of non-idle cycles would result in good be- 
havior, but the previous example highlights why we use 
the speed-setting policies described in §4.3. We used the 
same AVGy scheduler proposed by Govil [6] and Per- 
ing [5] and also examined by Pering et al. in [5]; Per- 
ings later paper in [11] did not examine scheduler heuris- 
tics and only used real-time scheduling with application- 
specified scheduling goals. 


Our findings indicate that the AVGy algorithm can not 
settle on the clock speed that maximizes CPU utilization. 
Although a given set of parameters can result in optimal 
performance for a single application, these tuned param- 
eters will probably not work for other applications, or 
even the same application with different input. The vari- 
ance inherent in many deadline-based applications pre- 
vents an accurate assessment of the computational needs 
of an application. The AVGy policy can be easily de- 
signed to ensure that very few deadlines will be missed, 
but this results in minimal energy savings. We use an 
MPEG player as a running example in this section, as 
it best exemplifies behavior that illustrates the multitude 
of problems in past-based interval algorithms. Our in- 
tuition is that if there’s is a single application that il- 
lustrates simple, easy-to-predict behavior, it should be 
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MPEG. Our measurements showed that the MPEG ap- 
plication can run at 132MHz without dropping frames 
and still maintain synchronization between the audio and 
video. An ideal clock scheduling policy would therefore 
target a speed of 132MHz. 


However, without information from the user level appli- 
cation, a kernel cannot accurately determine what dead- 
lines an application operates under. First, an application 
may have different deadline requirements depending on 
its input; for example, an MPEG player displaying a 
movie at 30fps has a shorter deadline than one running 
at 15fps. Although the deadlines for an application with 
a given input may be regular, the computation required 
in each deadline interval can vary widely. Again, MPEG 
players demonstrate this behavior; I-frames (key or ref- 
erence) require much more computation than P-frames 
(predicted), and do not necessarily occur at predictable 
intervals. 


One method of dealing with this variance is to look at 
lengthy intervals which will, by averaging, reduce the 
variance of the computational observations. Our uti- 
lization plots showed that even using 100ms intervals, 
significant variance is exhibited. In addition to interval 
length, the number of intervals over which we average 
(N) of the AVGy policy can also be manipulated. We 
conducted a comprehensive study and varied the value 
of N from 0 (the PAST policy) to 10 with each com- 
bination of the speed-setting policies (i.e. using “peg” 
to set the CPU speed to the highest point, or “one” to 
increment or decrement the speed). 


Our conclusions from the results with our benchmarks 
is that the weighted average has undesirable behavior. 
The number of intervals not only represents the length 
of interval to be considered; it also represents the lag be- 
fore the system responds, much like the simple averag- 
ing example described above. Unlike that simple policy, 
once AVGy starts responding, it will do so quickly. For 
example, consider a system using an AVG g mechanism 
with an upper boundary of 70% utilization and “one” as 
the algorithms used to increment or decrement the clock 
speed. Starting from an idle state, the clock will not scale 
to 206MHz for 120 ms (12 quanta). Once it scales up, 
the system will continue to do so (as the average utiliza- 
tion will remain above 70%) unless the next quantum is 
partially idle. This occurs because the previous history is 
still considered with equal weight even when the system 
is running at a new clock value. 


The boundary conditions used by Pering in [5] result in 
a system that scales more rapidly down than up. Table 1 
illustrates how this occurs. If the weighted average is 
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Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Active 
Idle 
Idle 
Idle 
Idle 
Idle Scale down 


Scale up 
Scale up 
Scale up 
Scale up 
Scale up 





Table 1: Scheduling Actions for the AVG 9 Policy 


70%, a fully active quantum will only increase the aver- 
age to 73% while a fully idle quantum will reduce it to 
63% — thus, there is a tendency to reduce the processor 
speed. 


The job of the scheduler is made even more difficult by 
applications that attempt to make their own scheduling 
decisions. For example, the default MPEG player in 
the Itsy software distribution uses a heuristic to decide 
whether it should sleep before computing the next frame. 
If the rendering of a frame completes and the time until 
that frame is needed is less than 12ms, the player en- 
ters a spin loop; if it is greater than 12ms, the player 
relinquishes the processor by sleeping. Therefore, if the 
player is well ahead of schedule, it will show significant 
idle times; once the clock is scaled close to the optimal 
value to complete the necessary work, the work seem- 
ingly increases. The kernel has no method of determin- 
ing that this is wasteful work. 


Furthermore, there is some mathematical justification 
for our assertion that AVGy fundamentally exhibits un- 
desirable behavior, and will not stabilize on an optimal 
clock speed, even for simple and predictable workloads. 
Our analysis only examines the “smoothing” portion of 
AVGy, not the clock setting policy. Nevertheless, it 
works well enough to highlight the instability issues with 
AVGy by showing that, even if the system is started out 
at the ideal clock speed, AVG, smoothing will still result 
in undesirable oscillation. 
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Figure 6: Fourier Transform of a Decaying Exponential 


A processor workload over time may be treated as a 
mathematical function, taking on a value of 1 when the 
processor is busy, and 0 when idling. Borrowing tech- 
niques from signal processing allows us to characterize 
the effect of AVGy on workloads in general as well as 
specific instances. AVGy filters its input using a decay- 
ing exponential weighting function. For our implemen- 
tation, we used a recursive definition in terms of both 
the previous actual (U;_,) and weighted (Wz_}) utiliza- 
tions: W, = NxWi-1+Ue-1 For the analysis, however, 
it is useful to transform this into a less computation- 
ally practical representation, purely in terms of earlier 
unweighted utilizations. By recursively expanding the 
W,_, term and performing a bit of algebra, this repre- 
sentation emerges: We = qyg Dico (ag) SOU. 
This equation explicitly shows the dependency of each 
W, on all previous U¢, and makes it more evident that 
the weighted output may also be expressed as the result 
of discretely convolving a decaying exponential func- 
tion with the raw input. This allows us to examine spe- 
cific types of workloads by artificially generating a rep- 
resentative workload and then numerically convolving 
the weighting function with it. We can also get a quali- 
tative feel for the general effects AVG, has by moving to 
continuous space and looking at the Fourier transform of 
a decaying exponential, since convolving two functions 
in the time domain is equivalent to multiplying their cor- 
responding Fourier transforms. 


Lets begin by examining the Fourier transform of a de- 
caying exponential: z(t) = e~°*u(t), where u(t) is the 
unit step function, 0 for allt < Oand 1 fort > 0. 
This captures the general shape of the AVGy weight- 
ing function, shown in Figure 6. Its Fourier transform is 
X(w) = zc. The transform attenuates, but does not 
eliminate, higher frequency elements. If the input sig- 
nal oscillates, the output will oscillate as well. As a@ gets 
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Figure 7: Result of AVG3 Filtering on a the Processor 
Utilization for a Periodic Workload Over Time 


smaller the higher frequencies are attenuated to a greater 
degree, but this corresponds to picking a larger value for 
N in AVGy and comes at the expense of greater lag in 
response to changing processor load. 


For a specific workload example, we’ll use a simple re- 
peating rectangle wave, busy for 9 cycles, and then idle 
for 1 cycle. This is an idealized version of our MPEG 
player running roughly at an optimal speed, i.e. just idle 
enough to indicate that the system isn’t saturated. Ide- 
ally, a policy should be stable when it has the system run- 
ning at an optimal speed. This implies that the weighted 
utilization should remain in a range that would prevent 
the processor speed from changing. However, as was 
fore-shadowed by our initial qualitative discussion, this 
is not the case. A rectangular wave has many high fre- 
quency components, and these result in a processor uti- 
lization as shown in Figure 7. This figure shows the os- 
cillation for this example, and shows that oscillation oc- 
curs over a surprisingly wide range of the processor uti- 
lization. As discussed earlier, our experimental results 
with the MPEG player on the Itsy also exhibit this os- 
cillation because that application exhibits the same step- 
function resource demands exhibited by our example. 


We also simulated interval-based averaging policies that 
used a pure average rather than an exponentially de- 
caying weighting function, but our simulations indi- 
cated that that policy would perform no better than the 
weighted averaging policy. Simple averaging suffers 
from the same problems experienced by the weighted 
averaging if you do not average the appropriate period. 
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5.4 Summary of Results 


We are omitting a detailed exposition on the scheduling 
behavior of each scheduling policy primarily because 
most of them resulted in equivalent (and poor) behavior. 
Recall that the best possible scheduling goal for MPEG 
would be to switch to a 132MHz speed and continue to 
render all the frames at that speed. No heuristic policy 
that we examined achieved this goal. Figure 8 shows 
the clock setting behavior of the best policy we found. 
That policy uses the PAST heuristic (i.e. AVGg) and 
“pegs” the CPU speed either to 206MHz or 59MHz de- 
pending on the weight metric. The bounds on the hys- 
teresis where that a CPU utilization greater than 98% 
would cause the CPU to increase the clock speed and a 
CPU utilization less than 93% would decrease the clock 
speed. 


This policy is “best” because it never misses any dead- 
line (across all the applications) and it also saves a small 
but significant amount of energy. This last point is il- 
lustrated in Table 2. This table shows the 95% confi- 
dence interval for the average energy needed to run the 
MPEG application. The reduction in energy between 
206MHz and 132MHz occurs because the application 
wastes fewer cycles in the application idle loop used to 
meet the frame delays for the MPEG clip. A ~ 8% en- 
ergy reduction occurs when we drop the processor volt- 
age to 1.23V — this is less than the 15% maximum reduc- 
tion we measured because the application uses resources 
(e.g. audio) that are not affected by voltage scaling. 





Figure 8: Clock frequency for the MPEG application us- 
ing the best scheduling policy from our empirical study 
— the scheduling policy only select 59Mhz or 206MHz 
clock settings and changes clock settings frequently. 
This scheduling policy results in suboptimal energy sav- 
ings but avoids noticeable application slowdown. 


The PAST policy we described results in a small but sta- 
tistically significant reduction in energy for the MPEG 
application. Allowing the processor to scale the voltage 
when the clock speed drops below 162.2MHz results in 
no statistical decrease. 


We initially surmised that there is no improvement be- 
cause the cost of voltage and clock scaling on our plat- 
form out-weighs any gains. We measured the cost of 
clock and voltage scaling using the DAQ. To measure 
clock scaling, we coded a tight loop that switched the 
processor clock as quickly as possible. 


Before each clock change, we inverted the state of a spe- 
cific GPIO and used the DAQ to measure the interval 
with high precision. We took measurements when the 
clock changed across many different clock settings ( e.g. 
from 59 to 206MHz, from 191 to 206MHz and so on). 


Clock scaling took approximately 200microseconds, in- 
dependent of the starting or target speed. During that 
time, the processor can not execute instructions. Thus, 
frequency changing varies between 11, 200 clock peri- 
ods at 59MHz and 40, 000 clock periods at 200MHz. 


We measured the time for the voltage to settle follow- 
ing a voltage change. It takes = 250 microseconds to 
reduce voltage from 1.5V to 1.23V; in fact, the volt- 
age slowly reduces, drops below 1.23V and then rapidly 
settles on 1.23V. Voltage increases were effectively in- 
stantaneous. We suspect the slow decay occurs because 
of capacitance; many processors use external decoupling 
capacitors to provide sufficient current sourcing for pro- 
cessors that have widely varying current demands. 


These measurements indicate that the time needed for 
clock and voltage changes are less than 2% of the 
scheduling interval; thus, we would be able to change 
the clock or voltage on every scheduling decision with 
less than 2% overhead. The fact that we see little energy 
reduction is related to the limited energy savings possi- 
ble with the voltage scaling available on this platform 
and the efficacy of the policies we explored. 


6 Conclusions and Future Work 


Our implementation results were disappointing to us — 
we had hoped to be able to identify a prediction heuris- 
tic that resulted in significant energy savings, and we 
thought that the claims made by previous studies would 
be born out by experimentation. Although we have 
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Algorithm 
Constant Speed @ 206.4 MHz, 1.5 Volts 
Constant Speed @ 132.7 MHz, 1.5 Volts 
Constant Speed @ 132.7 MHz, 1.23 Volts 


PAST, Peg - Peg, Thresholds: > 98% scales up, < 


93% scales down, 1.5 Volts 





85.59 - 86.49 
79.59 - 80.94 
73.76 - 74.41 
85.03 - 85.47 


PAST, Peg - Peg, Thresholds: > 98% scales up, < | 84.60 - 85.45 
93% scales down, Voltage Scaling @ 162.2 MHz 


Table 2: Summary of Performance of Best Clock Scaling Algorithms 
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Figure 9: Non-linear change in Utilization with Clock 


Frequency (in MHz) 

Processor Cycles/Mem. Cycles / Cache 
Freq. Reference Reference 
59.0 11 39 
73.7 11 39 
88.5 11 39 
103.2 11 39 
118.0 13 41 
132.7 14 42 
147.5 14 49 
162.2 15 50 
176.9 18 60 
191.7 19 61 
206.4 20 69 


Table 3: Memory access time in cycles for reading indi- 
vidual words as well as full cache lines. 
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found a policy that saves some energy, that policy leaves 
much to be desired. The policy causes many voltage and 
clock changes, which may incurr unnecessary overhead; 
this will be less of a problem as processors are better 
designed to accommodate those changes. However, the 
policy did result in both the most responsive system be- 
havior and most significant energy reduction of all the 
policies we examined. 


As with all empirical studies, there are anomalies in our 
system that we can not explain and that may have influ- 
enced our results. We found that the processor utiliza- 
tion does not always vary linearly with clock frequency. 
Figure 9 shows the processor utilization vs. clock fre- 
quency for the MPEG benchmark. There is a distinct 
“plateau” between 162MHz and 176.9MHz. We believe 
that this delay may be induced by the varying number 
of clock cycles needed for memory accesses as the pro- 
cessor frequency changes, as shown in Table 3. That 
table shows the memory access time for EDODRAM 
for reading individual words or a full cache line; there 
is an obvious non-linear increase between 162MHz and 
176.9MHz. The potential speed mismatch between pro- 
cessor and memory has been noted by others [12], but 
we have not devised a way to verify that this is the only 
factor causing the non-linear behavior we noted. 


This paper is the first step on an effort to provide ro- 
bust support for voltage and clock scheduling within the 
Linux operating system. Although our initial results are 
disappointing, we feel that they serve to stop us from at- 
tempting to devise clever heuristics that could be used 
for clock scheduling. It may well be that Pering [11] 
reached a similar conclusion since their later publica- 
tions discontinued the use of hueristics, but their publi- 
cations don’t describe the implementation of their oper- 
ating system design or the rational behind the policies 
used. Furthermore, they don’t describe how deadlines 
are to be “synthesized” for applications such as Web, 
TalkingEditor and Web where there is no clear “dead- 
line”. 
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Our immediate future work is to provide “deadline” 
mechanisms in Linux. These deadlines are not precisely 
the same mechanism needed in a true real-time O/S — 
in a RTOS, the application does not care if the deadline 
is reached early, while energy scheduling would prefer 
for the deadline to be met as late as possible. A further 
challenge we face will be to find a way to automatically 
synthesize those deadlines for complex applications. 
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Abstract 


Freeblock scheduling is a new approach to utilizing 
more of a disk’s potential media bandwidth. By 
filling rotational latency periods with useful media 
transfers, 20-50% of a never-idle disk’s bandwidth 
can often be provided to background applications 
with no effect on foreground response times. This 
paper describes freeblock scheduling and demon- 
strates its value with simulation studies of two con- 
crete applications: segment cleaning and data min- 
ing. Free segment cleaning often allows an LFS file 
system to maintain its ideal write performance when 
cleaning overheads would otherwise reduce perfor- 
mance by up to a factor of three. Free data mining 
can achieve over 47 full disk scans per day on an 
active transaction processing system, with no effect 
on its disk performance. 


1 Introduction 


Disk drives increasingly limit performance in many 
computer systems, creating complexity and restrict- 
ing functionality. However, in recent years, the 
rate of improvement in media bandwidth (48+% per 
year) has stayed close to that of computer system at- 
tributes that are driven by Moore’s Law. It is only 
the mechanical positioning aspects (i.e., seek times 
and rotation speeds) that fail to keep pace. If 100% 
utilization of the potential media bandwidth could 
be realized, disk performance would scale roughly 
in proportion to the rest of the system over time. 
Unfortunately, utilizations of 2-15% are more com- 
monly observed in practice. 


This paper describes and analyzes a new approach, 
called freeblock scheduling, to increasing media band- 
width utilization. By interleaving low priority disk 
activity with the normal workload (here referred to 
as background and foreground, respectively), one 
can replace many foreground rotational latency de- 


lays with useful background media transfers. With 
appropriate freeblock scheduling, background tasks 
can receive 20-50% of a disk’s potential media band- 
width without any increase in foreground request 
service times. Thus, this background disk activity is 
completed for free during the mechanical positioning 
for foreground requests. 


There are many disk-intensive background tasks 
that are designed to occur during otherwise idle 
time. Examples include disk reorganization, file sys- 
tem cleaning, backup, prefetching, write-back, in- 
tegrity checking, RAID scrubbing, virus detection, 
tamper detection, report generation, and index re- 
organization. When idle time does not present itself, 
these tasks either compete with foreground tasks or 
are simply not completed. Further, when they do 
compete with other tasks, these background tasks do 
not take full advantage of their relatively loose time 
constraints and paucity of sequencing requirements. 
As a result, these “idle time” tasks often cause per- 
formance or functionality problems in busy systems. 
With freeblock scheduling, background tasks can op- 
erate continuously and efficiently, even when they do 
not have the system to themselves. 


This paper quantifies the effects of disk, workload, 
and disk scheduling algorithms on potential free 
bandwidth. Algorithms are developed for increas- 
ing the available free bandwidth and for efficient 
freeblock scheduling. For example, with less than 
a 6% increase in average foreground access time, 
a Shortest-Positioning-Time-First scheduling algo- 
rithm that favors reduction of seek time over reduc- 
tion of rotational latency can provide an additional 
66% of free bandwidth. Experiments also show 
that freeblock scheduling decisions can be made ef- 
ficiently enough to be effective in highly loaded sys- 
tems. 


This paper uses simulation to explore freeblock 
scheduling, demonstrating its value with concrete 
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examples of its use for storage system management 
and disk-intensive applications. The first example 
shows that cleaning in a log-structured file system 
can be done for free even when there is no truly idle 
time, resulting in up to a 300% speedup. The second 
example explores the use of free bandwidth for data 
mining on an active on-line transaction processing 
(OLTP) system, showing that over 47 full scans per 
day of a 9GB disk can be made with no impact on 
OLTP performance. 


In a recent paper [45], we proposed a scheme for 
performing data mining “for free” on a busy OLTP 
system. The scheme combines Active Disks [46] with 
use of idle time and aggressive interleaving of data 
mining requests with OLTP requests. This paper 
generalizes and extends the latter, developing an un- 
derstanding of free bandwidth availability and ex- 
ploring its use. 


The remainder of this paper is organized as follows. 
Section 2 describes free bandwidth and discusses its 
use in systems. Section 3 quantifies the availabil- 
ity of potential free bandwidth and how it varies 
with disk characteristics, foreground workloads, and 
foreground disk scheduling algorithms. Section 4 de- 
scribes our freeblock scheduling algorithm. Section 5 
evaluates the use of free bandwidth for cleaning of 
LFS log segments. Section 6 evaluates the use of 
free bandwidth for data mining of active OLTP sys- 
tems. Section 7 discusses related work. Section 8 
summarizes the paper’s contributions. 


2 Free Bandwidth 


At a high-level, the time required for a disk media 
access, Taccess, can be computed as 


Waceeae = Tyeek + Trotate + Ttransfer 


Of Taccess, Only the Ttransfer Component repre- 
sents useful utilization of the disk head. Unfortu- 
nately, the other two components generally domi- 
nate. Many data placement and scheduling algo- 
rithms have been devised to increase disk head uti- 
lization by increasing transfer sizes and reducing po- 
sitioning overheads. Freeblock scheduling comple- 
ments these techniques by transferring additional 
data during the T;5tate component of Toccess- 


Fundamentally, the only time the disk head can- 
not be transferring data sectors to or from the me- 
dia is during a seek. In fact, in most modern disk 
drives, the firmware will transfer a large request’s 
data to or from the media “out of order” to mini- 
mize wasted time; this feature is sometimes referred 
to as zero-latency or immediate access. While seeks 


are unavoidable costs associated with accessing de- 
sired data locations, rotational latency is an artifact 
of not doing something more useful with the disk 
head. Since disk platters rotate constantly, a given 
sector will rotate past the disk head at a given time, 
independent of what the disk head is doing up until 
that time. So, there is an opportunity to do some- 
thing more useful than just waiting for desired sec- 
tors to arrive at the disk head. 


Freeblock scheduling consists of predicting how 
much rotational latency will occur before the next 
foreground media transfer, squeezing some addi- 
tional media transfers into that time, and still get- 
ting to the destination track in time for the fore- 
ground transfer. The additional media transfers may 
be on the current or destination tracks, on another 
track near the two, or anywhere between them, as 
illustrated in Figure 1. In the two latter cases, ad- 
ditional seek overheads are incurred, reducing the 
actual time available for the additional media trans- 
fers, but not completely eliminating it. 


Accurately predicting future rotational latencies re- 
quires detailed knowledge of many disk performance 
attributes, including layout algorithms and time- 
dependent mechanical positioning overheads. These 
predictions can utilize the same basic algorithms and 
information that most modern disks employ for their 
internal scheduling decisions, which are based on 
overall positioning overheads (seek time plus rota- 
tional latency) [50, 30]. However, this may require 
that freeblock scheduling decisions be made by disk 
firmware. Fortunately, the increasing processing ca- 
pabilities of disk drives [1, 22, 32, 46] make advanced 
on-drive storage management feasible (22, 57]. 


2.1 Using Free Bandwidth 


Potential free bandwidth exists in the time gaps that 
would otherwise be rotational latency delays for fore- 
ground requests. Therefore, freeblock scheduling 
must opportunistically match these potential free 
bandwidth sources to real bandwidth needs that can 
be met within the given time gaps. The tasks that 
will utilize the largest fraction of potential free band- 
width are those that provide the freeblock scheduler 
with the most flexibility. Tasks that best fit the free- 
block scheduling model have low priority, large sets 
of desired blocks, no particular order of access, and 
small working memory footprints. 


Low priority. Free bandwidth is inherently in the 
background, and freeblock requests will only be ser- 
viced when opportunities arise. Therefore, response 
times may be extremely long for such requests, mak- 
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Disk Rotation 


Seek to B's track 


Rotational latency 


(a) Original sequence of foreground requests. 


After freeblock read 


Seek to B's track 


(b) One freeblock scheduling alternative. 


Seek to another track 


After freeblock read 


Seek to B's track 


(c) Another freeblock scheduling alternative. 


Figure 1: Ilustration of two freeblock scheduling possibilities. 





Three sequences of steps are shown, each starting 


after completing the foreground request to block A and finishing after completing the foreground request to block B. Each step 
shows the position of the disk platter, the read/write head (shown by the pointer), and the two foreground requests (in black) 
afler a partial rotation. The top row, labelled (a), shows the default sequence of disk head actions for servicing request B, 
which includes 4 sectors worth of potential free bandwidth (a.k.a. rotational latency). The second row, labelled (b), shows free 
reading of 4 blocks on A’s track using 100% of the potential free bandwidth. The third row, labelled (c), shows free reading of 
3 blocks on another track, yielding 75% of the potential free bandwidth. 


ing them most appropriate for background activities. 
Further, freeblock scheduling is not appropriate for a 
set of equally important requests; splitting such a set 
between a foreground queue and a freeblock queue 
reduces the options of both schedulers. All such re- 
quests should be considered by a single scheduler. 


Large sets of desired blocks. Since freeblock 
schedulers work with restricted free bandwidth op- 
portunities, their effectiveness tends to increase 
when they have more options. That is, the larger the 
set of disk locations that are desired, the higher the 
probability that a free bandwidth opportunity can 
be matched to a need. Therefore, tasks that involve 
larger fractions of the disk’s capacity generally uti- 
lize larger fractions of the potential free bandwidth. 


No particular order of access. Ordering require- 
ments restrict the set of requests that can be con- 
sidered by the scheduler at any point in time. Since 
the effectiveness of freeblock scheduling is directly 
related to the number of outstanding requests, work- 
loads with little or no ordering requirements tend to 


utilize more of the potential free bandwidth. 


Small working memory footprints.  Signifi- 
cant need to buffer multiple blocks before process- 
ing them creates artificial ordering requirements due 
to memory limitations. Workloads that can im- 
mediately process and discard data from freeblock 
requests tend to be able to request more of their 
needed data at once. 


To clarify the types of tasks that fit the freeblock 
scheduling model, Table 1 presents a sample inter- 
face for a freeblock scheduling subsystem, ignoring 
component and protection boundary issues. This in- 
terface is meant to be illustrative only; a comprehen- 
sive API would need to address memory allocation, 
protection, and other issues. 


This sample freeblock API has four important char- 
acteristics. First, no call into the freeblock schedul- 
ing subsystem waits for a disk access. Instead, calls 
to register requests return immediately, and subse- 
quent callbacks report request completions. This al- 
lows applications to register large sets of freeblock 
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Arguments 
diskaddrs, blksize, callback 


Function Name 


freeblock_readblocks 
freeblock_writeblocks 


freeblock_abort diskaddrs, blksize 
freeblock_promote diskaddrs, blksize 


*(callback) diskaddr, blksize, buffer 


Table 1: A simple interface to a freeblock subsystem. 


diskaddrs, blksize, buffers, callback 


Register freeblock read request(s) 
Abort registered freeblock request(s) 
Promote registered freeblock request(s) 


Call back to task with desired block 





freeblock_readblocks and freeblock_writeblocks register one or 


more single-block freeblock requests, with an application-defined block size. freeblock.abort and freeblock_promote are applied 
to previously registered requests, to either cancel pending freeblock requests or convert them to foreground requests. When 
promoted, multiple contiguous freeblock requests can be merged into a single foreground request. *(callback) is called by the 
freeblock subsystem to report availability (or write completion) of a single previously-requested block. When the request was 
a read, buffer points to a buffer containing the desired data. The freeblock subsystem reclaims this buffer when *(callback) 
returns, meaning that the callee must either process the data immediately or copy it to another location before returning 


control. 


requests. Second, block sizes are provided with each 
freeblock request, allowing applications to ensure 
that useful units are provided to them. Third, free- 
block read requests do not specify memory locations 
for read data. Completion callbacks provide pointers 
to buffers owned by the freeblock scheduling subsys- 
tem and indicate which requested data blocks are 
in them. This allows tasks to register many more 
freeblock reads than their memory resources would 
otherwise allow, giving greater flexibility to the free- 
block scheduling subsystem. For example, the data 
mining example in Section 6 starts by registering 
freeblock reads for all blocks on the disk. Fourth, 
freeblock requests can be aborted or promoted to 
foreground requests at any time. The former al- 
lows tasks to register for more data than are ab- 
solutely required (e.g., a search that only needs one 
match). The latter allows tasks to increase the prior- 
ity of freeblock requests that may soon impact fore- 
ground task performance (e.g., a space compression 
task that has not made sufficient progress). 


2.2 Applications 


Freeblock scheduling is a new tool, and we expect 
that system designers will find many unanticipated 
uses for it. This section describes some of the appli- 
cations we see for its use. 


Scanning applications. In many systems, there 
are a variety of support tasks that scan large por- 
tions of disk contents. Such activities are of direct 
benefit to users, although they may not be the high- 
est priority of the system. Examples of such tasks 
include report generation, RAID scrubbing, virus 
detection, tamper detection [33], and backup. Sec- 
tion 6 explores data mining of an active transaction 
processing system as a concrete example of such use 
of free bandwidth. 
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These disk-scanning application tasks are ideal 
candidates for free bandwidth utilization. Appro- 
priately structured, they can exhibit all four of 
the desirable characteristics discussed above. For 
example, report generation tasks (and data mining 
in general) often consist of collecting statistics about 
large sets of small, independent records. These 
tasks may be of lower priority than foreground 
transactions, access a large set of blocks, involve 
no ordering requirements, and process records 
immediately. Similarly, virus detectors examine 
large sets of files for known patterns. The files can 
be examined in any order, though internal statistics 
for partially-checked files may have significant 
memory requirements when pieces of files are read 
in no particular order. Backup applications can 
be based on physical format, allowing flexible 
block ordering with appropriate indices, though 
single-file restoration is often less efficient [28, 14]. 
Least flexible of these examples would be tamper 
detection that compares current versions of data 
to “safe” versions. While the comparisons can 
be performed in any order, both versions of a 
particular datum must be available in memory 
to complete a comparison. Memory limitations 
are unlikely to allow arbitrary flexibility in this case. 


Internal storage optimization. Another promis- 
ing use for free bandwidth is internal storage sys- 
tem optimization. Many techniques have been de- 
veloped for reorganizing stored data to improve per- 
formance of future accesses. Examples include plac- 
ing related data contiguously for sequential disk ac- 
cess [37, 57], placing hot data near the center of 
the disk (56, 48, 3], and replicating data on disk 
to provide quicker-to-access options for subsequent 
reads [42, 61]. Other examples include index reorga- 
nization (29, 23] and compression of cold data [11]. 
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Section 5 explores segment cleaning in log-structured 
file systems as a concrete example of such use of free 
bandwidth. 


Although internal storage optimization activities ex- 
hibit the first two qualities listed in Section 2.1, they 
can impose some ordering and memory restrictions 
on media accesses. For example, reorganization gen- 
erally requires clearing (i.e., reading or moving) des- 
tination regions before different data can be written. 
Also, after opportunistically reading data for reorga- 
nization, the task must write this data to their new 
locations. Eventually, progress will be limited by the 
rate at which these writes can be completed, since 
available memory resources for buffering such data 
are finite. 


Prefetching and Prewriting. Another use of free 
bandwidth is for anticipatory disk activities such 
as prefetching and prewriting. Prefetching is well- 
understood to offer significant performance enhance- 
ments [44, 9, 25, 36, 54]. Free bandwidth prefetch- 
ing should increase performance further by avoiding 
interference with foreground requests and by min- 
imizing the opportunity cost of aggressive predic- 
tions. As one example, the sequence shown in Fig- 
ure 1(b) shows one way that the prefetching com- 
mon in disk firmware could be extended with free 
bandwidth. Still, the amount of prefetched data is 
necessarily limited by the amount of memory avail- 
able for caching, restricting the number of freeblock 
requests that can be issued. 


Prewriting is the same concept in reverse. That is, 
prewriting is early writing out of dirty blocks un- 
der the assumption that they will not be overwrit- 
ten or deleted before write-back is actually neces- 
sary. As with prefetching, the value of prewriting 
and its relationship with non-volatile memory are 
well-known [4, 10, 6, 23]. Free bandwidth prewrit- 
ing has the same basic benefits and limitations as 
free prefetching. 


3 Availability of Free Bandwidth 


This section quantifies the availability of potential 
free bandwidth, which is equal to a disk’s total po- 
tential bandwidth multiplied by the fraction of time 
it spends on rotational latency delays. The amount 
of rotational latency depends on a number of disk, 
workload, and scheduling algorithm characteristics. 


The experimental data in this section was gener- 
ated with the DiskSim simulator [20], which has 
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Figure 2: Disk head usage for several modern disks. 
The cross-hatch portion, representing rotational latency, in- 
dicates the percentage of total disk bandwidth available as 
potential free bandwidth. 


been shown to accurately model several modern disk 
drives [17], including those explored here. By de- 
fault, the experiments use a Quantum Atlas 10K 
disk drive and a synthetic workload referred to as 
random. This random workload consists of 10,000 
foreground requests issued one at a time with no idle 
time between requests (closed system arrival model 
with no think time). Other default parameters for 
the random workload are request size of 4KB, uni- 
form distribution of starting locations across the disk 
capacity, and 2:1 ratio of reads to writes. 


Most of the bar graphs presented in this section have 
a common structure. Each bar breaks down disk 
head usage into several regions that add up to 100%, 
with each region representing the percentage of the 
total attributed to the corresponding activity. All 
such bars include regions for foreground seek times, 
rotational latencies, and media transfers. The ro- 
tational latency region represents the potential free 
bandwidth (as a percentage of the disk’s total band- 
width) available for the disk-workload combination. 


3.1 Impact of disk characteristics 


Figure 2 shows breakdowns of disk head usage for 
five modern disk drives whose basic characteristics 
are given in Table 2. Overall, for the random work- 
load, about one third (27-36%) of each disk’s head 
usage can be attributed to rotational latency. Thus, 
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Quantum Seagate Seagate Seagate 
|| ae | cnt | neta te | che is | ura 9s 
1999 1996 1997 1998 1998 
9 GB 4.5 GB 9 GB 9 GB 9 GB 
10042 6581 6962 9772 11474 
Tracks per cylinder 6 8 12 6 5 
Sectors per track 229-334 131-195 167-254 252-360 247-390 
Spindle speed (RPM) 10025 10033 10025 10025 7200 
Average seek 5.0 ms 7.7 ms 5.4 ms 5.2 ms 7.0 ms 
Min-Max seeks 1.2-10.8 ms 0.6—-16.1 ms 0.8—10.6 ms 0.7-10.8 ms 1.1-12.7 ms 


uae 
Cylinders 





Table 2: Basic characteristics of several modern disk drives. 
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Figure 3: Disk head usage as a function of request 
size. 


about one third of the media bandwidth is avail- 
able for freeblock scheduling, even with no inter- 
request locality. At a more detailed level, the effect 
of key disk characteristics can be seen in the break- 
downs. For example, the faster seeks of the Cheetah 
9LP, relative to the Cheetah 4LP, can be seen in the 
smaller seek component. 


3.2 Impact of workload characteristics 


Figure 3 shows how the breakdown of disk head us- 
age changes as the request size of the random work- 
load increases. As expected, larger request sizes 
yield larger media transfer components, reducing the 
seek -and latency components by amortizing larger 
transfers over each positioning step. Still, even for 
large random requests (e.g., 256KB), disk head uti- 
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Figure 4: Disk head usage as a function of access 
locality. The default workload was modified such that 
a percentage of request starting locations are “local” (taken 
from a normal distribution centered on the last requested lo- 
cation, with a standard deviation of 4MB). The remaining 
requests are uniformly distributed across the disk’s capacity. 
This locality model crudely approximates the effect of “cylin- 
der group” layouts [38] on file system workloads. 


lization is less than 55% and potential free band- 
width is 15%. 


Figure 4 shows how the breakdown of disk head 
usage changes as the degree of access locality in- 
creases. Because access locality tends to reduce seek 
distances without directly affecting the other compo- 
nents, this graph shows that both the transfer and 
latency components increase. For example, when 
70% of the requests are within the same “cylinder 
group” [38] as the last request, almost 60% of the 
disk head’s time is spent in rotational latency and 





USENIX Association 


USENIX Association 


is thus available as free bandwidth. Since disk ac- 
cess locality is a common attribute of many environ- 
ments, one can generally expect more potential free 
bandwidth than the 33% predicted for the random 
workload. 


Figure 4 does not show the downside (for freeblock 
scheduling) of high degrees of locality — starvation 
of distant freeblock requests. That is, if foreground 
requests keep the disk head in one part of the disk, it 
becomes difficult for a freeblock scheduler to success- 
fully make progress on freeblock requests in distant 
parts of the disk. This effect is taken into account 
in the experiments of Sections 5 and 6. 


3.3. Impact of scheduling algorithm 


Figure 5 shows how the breakdown of disk head 
usage changes for different scheduling algorithms 
applied to foreground requests. Specifically, four 
scheduling algorithms are shown:  First-Come- 
First-Served (FCFS), Circular-LOOK (C-LOOK), 
Shortest-Seek-Time-First (SSTF), and Shortest- 
Positioning-Time-First (SPTF). FCFS serves re- 
quests in arrival order. C-LOOK [49] selects the 
next request in ascending starting address order; if 
none exists, it selects the request with the lowest 
starting address. SSTF [16] selects the request that 
will incur the shortest seek. SPTF [30, 50, 60] se- 
lects the request that will incur the smallest overall 
positioning delay (seek time plus rotational latency). 


On average, C-LOOK and SSTF reduce seek times 
without affecting transfer times and rotational la- 
tencies. Therefore, we expect (and observe) the seek 
component to decrease and the other two to increase. 
In fact, for this workload, the rotational latency 
component increases to 50% of the disk head usage. 
On the other hand, SPTF tends to decrease both 
overhead components, and Figure 5 shows that the 
rotational latency component decreasts significantly 
(to 22%) relative to the other scheduling algorithms. 


SPTF requires the same basic time predictions as 
freeblock scheduling. Therefore, its superior perfor- 
mance will make it a common foreground schedul- 
ing algorithm in systems that can support freeblock 
scheduling, making its effect on potential free band- 
width a concern. Tocounter this effect, we propose a 
modified SPTF algorithm that is weighted to select 
requests with both small total positioning delays and 
large rotational latency components. The algorithm, 
here referred to as SPTF-SWn%, selects the request 
with the smallest seek time component among the 
pending requests whose positioning times are within 
n% of the shortest positioning time. So, logically, 
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Figure 5: Disk head usage for several foreground 
scheduling algorithms. The default workload was mod- 
ified to always have 20 requests outstanding. Lowering the 
number of outstanding requests reduces the differences be- 
tween the scheduling algorithms, as they all converge on 
FCFS. 


this algorithm first uses the standard SPTF algo- 
rithm to identify the next most efficient request, de- 
noted A, to be scheduled. Then, it makes a second 
pass to find the pending request, denoted B, that 
has the smallest seek time while still having a to- 
tal positioning time within n% of A’s. Request B is 
then selected and scheduled. The actual implemen- 
tation makes a single pass, and its measured compu- 
tational overhead is only 2-5% higher than that of 
SPTF. This algorithm creates a continuum between 
SPTF (when n = 0) and SSTF (when n = oo), and 
we expect the disk head usage breakdown to reflect 
this. 


Figure 6 shows the breakdown of disk head usage 
and the average foreground request access time when 
SPTF-SWn% is used for foreground request schedul- 
ing. As expected, different values of n result in a 
range of options between SPTF and SSTF. As n in- 
creases, seek reduction becomes a priority, and the 
rotational latency component of disk head usage in- 
creases. At the same time, average access times in- 
crease as total positioning time plays a less domi- 
nant role in the decision process. Fortunately, the 
benefits increase rapidly before experiencing dimin- 
ishing returns, and the penalties increase slowly be- 
fore ramping up. So, using SPTF-SW40% as an ex- 
ample, we see that a 6% increase in average access 
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(a) Disk head usage 


Figure 6: Disk head usage and average access time with 


workload was modified to always have 20 requests outstanding. 


time can provide 66% more potential free bandwidth 
(i.e., 36% rotational latency for SPTF-SW40% com- 
pared to SPTF’s 22%). This represents half of the 
free bandwidth difference between SPTF and SSTF 
at much less than the 25% foreground access time 
difference. 


4 Freeblock Scheduling Decisions 


Freeblock scheduling is the process of identifying 
free bandwidth opportunities and matching them to 
pending freeblock requests. This section describes 
and evaluates the computational overhead of the 
freeblock scheduling algorithm used in our experi- 
ments. 


Our freeblock scheduler works independently of the 
foreground scheduler and maintains separate data 
structures. After the foreground scheduler chooses 
the next request, B, the freeblock scheduler is in- 
voked. It begins by computing the rotational la- 
tency that would be incurred in servicing B; this is 
the free bandwidth opportunity. This computation 
requires accurate estimates of disk geometry, current 
head position, seek times, and rotation speed. The 
freeblock scheduler then searches its list of pending 
freeblock requests for the most complete use of this 
opportunity; that is, our freeblock scheduler greedily 
schedules freeblock requests within free bandwidth 
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(b) Average access time 


SPTF-SWn% for foreground scheduling. The default 


opportunities based on the number of blocks that 
can be accessed. 


Our current freeblock scheduler assumes that the 
most complete use of a free bandwidth opportunity 
is the maximal answer to the question, “for each 
track on the disk, how many desired blocks could 
be accessed in this opportunity?”. For each track, 
t, answering this question requires computing the 
extra seek time involved with seeking to ¢ and then 
seeking to B’s track, as compared to seeking directly 
to B’s track. Answering this question also requires 
determining which disk blocks will pass under the 
head during the remaining rotational latency time 
and counting how many of them correspond to pend- 
ing freeblock requests. Note that no extra seek is 
required for the source track or for B’s track. 


Obviously, such an exhaustive search can be ex- 
tremely time consuming. We prune the search space 
in several ways. First, the freeblock scheduler skips 
all tracks for which the number of desired blocks is 
less than the best value found so far. Second, the 
freeblock scheduler only considers tracks for which 
the remaining free bandwidth (after extra seek over- 
heads) is greater than the best value found so far. 
Third, the freeblock scheduler starts by searching 
the source and destination cylinders (from the pre- 
vious and current foreground requests), which yield 
the best choices whenever they are fully populated, 
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and then searching in ascending order of extra seek 
time. Combined with the first two pruning steps, 
this ordered search frequently terminates quickly. 


The algorithm described above performs well when 
there is a large number of pending freeblock re- 
quests. For example, when 20-100% of the disk is 
desired, freeblock scheduling decisions are made in 
0-2.5ms on a 550MHz Intel Pentium III, which is 
much less than average disk access times. For such 
cases, it should be possible to schedule the next free- 
block request in real-time before the current fore- 
ground request completes, even with a less-powerful 
CPU. With greater fragmentation of freeblock re- 
quests, the time required for the freeblock scheduler 
to make a decision rises significantly. The worst- 
case computation time of this algorithm occurs when 
there are large numbers of small requests evenly dis- 
tributed across all cylinders. In this case, the al- 
gorithm searches a large percentage of the available 
disk space in the hopes of finding a larger section of 
blocks than it has already found. To address this 
problem, one can simply halt searches after some 
amount of time (e.g., the time available before the 
previous foreground request completes). In most 
cases, this has a negligible effect on the achieved 
free bandwidth. For all experiments in this paper, 
the freeblock scheduling algorithm was only allowed 
to search for the next freeblock request in the time 
that the current foreground request was being ser- 
viced. 


The base algorithm described here enables signifi- 
cant use of free bandwidth, as shown in subsequent 
sections. Nonetheless, development of more efficient 
and more effective freeblock scheduling algorithms 
is an important area for further work. This will in- 
clude using both free bandwidth and idle time for 
background tasks; the algorithm above and all ex- 
periments in this paper use only free bandwidth. 


5 Free cleaning of LFS segments 


The log-structured file system [47] (LFS) was de- 
signed to reduce the cost of disk writes. Towards 
this end, it remaps all new versions of data into large, 
contiguous regions called segments. Each segment is 
written to disk with a single I/O operation, amortiz- 
ing the cost of a single seek and rotational delay over 
a write of a large number of blocks. A significant 
challenge for LFS is ensuring that empty segments 
are always available for new data. LFS answers this 
challenge with an internal defragmentation opera- 
tion called cleaning. Ideally, all necessary cleaning 
would be completed during idle time, but this is not 


always possible in a busy system. The potential and 
actual penalties associated with cleaning have been 
the subject of heated debate [52] and several research 
efforts (51, 37, 7, 39, 59]. With freeblock scheduling, 
the cost of segment cleaning can be close to zero for 
many workloads. 


5.1 Design 


Cleaning of a previously written segment involves 
identifying the subset of live blocks, reading them 
into memory, and writing them into the next seg- 
ment. Live blocks are those that have not been over- 
written or deleted by later operations; they can be 
identified by examining the on-disk segment sum- 
mary structure to determine the original identity of 
each block (e.g., block 4 of file 3) and then examining 
the auxiliary structure for the block’s original owner 
(e.g., file 3’s inode). Segment summaries, auxiliary 
structures, and live blocks can be read via freeblock 
requests. There are ordering requirements among 
these, but live blocks can be read in any order and 
moved into their new locations immediately. 


Like other background LFS cleaners, our freeblock 
segment cleaner is invoked when the number of 
empty segments drops below a certain threshold. 
When invoked, the freeblock cleaner selects several 
non-empty segments and uses freeblock requests to 
clean them in parallel with other foreground re- 
quests. Cleaning several segments in parallel pro- 
vides more requests and greater flexibility to the 
freeblock scheduler. If the freeblock cleaner is not 
effective enough, the foreground cleaner will be acti- 
vated when the minimum threshold of free segments 
is reached. 


As live blocks in targeted segments are fetched, they 
are copied into the in-memory segment that is cur- 
rently being constructed by LFS writes. Because the 
live blocks are written into the same segment as data 
of foreground LFS requests, this method of cleaning 
is not entirely for free. The auxiliary data structure 
(e.g., inode) that marks the location of the block 
is updated to point to the block’s new location in 
the new segment. When all live blocks are cleaned 
from a segment on the disk, that segment becomes 
available for subsequent use. 


5.2. Experimental Setup 


To experiment with freeblock cleaning, we have 
modified a log-structured logical disk, called 
LLD [15]. LLD uses segments consisting of 128 
4KB blocks, of which 127 blocks are used for data 
and one block is used for segment summary. The 
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default implementation of LLD invokes its cleaner 
only when the number of free segments drops be- 
low a threshold (set to two segments). It does not 
implement background cleaning. Thus, all segment 
cleaning activity interferes with the foreground disk 
I/O. We replaced LLD’s default segment selection al- 
gorithm for cleaning with Sprite LFS’s cost-benefit 
algorithm|[47], yielding better performance for all of 
the cleaners. 


Our experiments were run under Linux 2.2.14 with 
a combination of real processing times and simu- 
lated I/O times provided by DiskSim. To accom- 
plish this, we merged LLD with DiskSim. Com- 
putation times between disk I/Os are measured 
with gettimeofday, which uses the Pentium cycle 
counter. These computation times are used to ad- 
vance simulation time in DiskSim. DiskSim call- 
backs report request completions, which are for- 
warded into the LLD code as interrupts. The con- 
tents of the simulated disk are stored in a regular file, 
and the time required to access this file is excluded 
from the reported results. 


All experiments were run on a 550 MHz Intel Pen- 
tium III machine with 256MB of memory. DiskSim 
was configured to model a modified Quantum Atlas 
10K disk. Specifically, since the maximal size of an 
LLD disk is 400MB, we modified the Atlas 10K spec- 
ifications to have only one data surface, resulting in 
a capacity of 1.5GB. Thus, the LLD “partition” oc- 
cupies about 1/4 of the disk. 


To assess the effectiveness of the freeblock cleaner, 
we used the Postmark v. 1.11 benchmark, which sim- 
ulates the small-file activity predominant on busy 
Internet servers [31]. Postmark initially creates a 
pool of files, then performs a series of transactions, 
and finally deletes all files created during the bench- 
mark run. A single transaction is one access to an 
existing file (i.e., read or append) and one file manip- 
ulation (i.e., file creation or deletion). We used the 
following parameter values: 5-10KB file size (default 
Postmark value), 25000 transactions, and 100 subdi- 
rectories. The ratios of read-to-write and create-to- 
delete were kept at their default values of 1:1. The 
number of files in the initial pool was varied to pro- 
vide a range of file system capacity utilizations. 


To age the file system, we run the transaction phase 
twice and report measurements for only the second 
iteration. The rationale for running the set of trans- 
actions the first time is to spread the blocks of the file 
system among the segments in order to more closely 
resemble steady-state operation. Recall that Post- 
mark first creates all files before doing transactions 
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Figure 7: LLD performance for three cleaning strate- 
gies. Even with a heavy foreground workload (Postmark), 
segment cleaning can be completed with just freeblock re- 
quests until the file system is 93% full. 


which results in all segments being either completely 
full or completely empty — a situation very unlikely 
in normal operation. 


5.3 Results 


Figure 7 shows Postmark’s performance for three 
different cleaner configurations: ORIGINAL is the 
default LLD cleaner with the Sprite LFS segment 
selection algorithm. FREEBLOCK is the freeblock 
cleaner, in which cleaning reads are freeblock re- 
quests and cleaning writes are foreground requests. 
IDEAL subtracts all cleaning costs from ORIGI- 
NAL and computes the corresponding throughput, 
which is unrealistic because infinitely fast foreground 
cleaning is not possible. 


Figure 7 shows the transactions per second for dif- 
ferent file system space utilizations, corresponding 
to different numbers of files initially created by Post- 
mark. The high throughput for low utilizations (less 
than 8% of capacity) is due to the LLD buffer cache, 
which absorbs all of the disk activity. IDEAL’s per- 
formance decreases as capacity utilization increases, 
because the larger set of files results in fewer cache 
hits for Postmark’s random file accesses. As disk 
utilization increases, ORIGINAL’s throughput de- 
creases consistently due to cleaning overheads, halv- 
ing performance at 60% capacity and quartering it 
at 85%. FREEBLOCK maintains performance close 
to IDEAL (up to 93% utilization). After 93%, there 
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is insufficient time for freeblock cleaning to keep 
up with the heavy foreground workload, and the 
performance of FREEBLOCK degrades as the fore- 
ground cleaner increasingly dominates performance. 
FREEBLOCK’s slow divergence from IDEAL be- 
tween 40% and 93% occurs because FREEBLOCK is 
being charged for the write cost of cleaned segments 
while IDEAL is not. 


6 Free data mining on OLTP systems 


The use of data mining to identify patterns in large 
databases is becoming increasingly popular over a 
wide range of application domains and datasets 
[19, 12, 58]. One of the major obstacles to start- 
ing a data mining project within an organization 
is the high initial cost of purchasing the necessary 
hardware. Specifically, the most common strategy 
for data mining on a set of transaction data is to 
purchase a second database system, copy the trans- 
action records from the OLTP system to the sec- 
ond system each evening, and perform mining tasks 
only on the second system. This strategy can dou- 
ble capital and operating expenses. It also requires 
that a company gamble a sizable up-front invest- 
ment to test suspicions that there may be interesting 
“nuggets” to be mined from their OLTP databases. 
With freeblock scheduling, significant mining band- 
width can be extracted from the original system 
without affecting the original transaction processing 
activity [45]. 


6.1 Design 


Data mining involves examining large sets of records 
for statistical features and correlations. Many 
data mining operations, including nearest neighbor 
search, association rules [2], ratio and singular value 
decomposition [34], and clustering [62, 26], eventu- 
ally transtate into a few scans of the entire dataset. 
Further, individual records can be processed imme- 
diately and in any order, matching three of the cri- 
teria of appropriate free bandwidth uses. 


Our freeblock mining example issues a single free- 
block read request for each scan. This freeblock re- 
quest asks for the entire contents of the database in 
page-sized chunks. The freeblock scheduler ensures 
that only blocks of the specified size are provided 
and that all the blocks requested are read exactly 
once. However, the order in which the blocks are 
read will be an artifact of the pattern of foreground 
OLTP requests. 


Interestingly, this same design is appropriate for 


some other storage activities. For example, RAID 
scrubbing consists of verifying that each disk sec- 
tor can be read successfully (i.e., that no sector has 
fallen victim to media corruption). Also, a phys- 
ical backup consists of reading all disk sectors so 
that they can be written to another device. The 
free bandwidth achieved for such scanning activities 
would match that shown for freeblock data mining 
in this section. 


6.2 Experimental Setup 


The experiments in Section 6.3 were conducted 
using the DiskSim simulator configured to model 
the Quantum Atlas 10K and a synthetic fore- 
ground workload based on approximations of ob- 
served OLTP workload characteristics. The syn- 
thetic workload models a closed system with per- 
task disk requests separated by think times of 30 
milliseconds. We vary the multiprogramming level 
(MPL), or number of tasks, of the OLTP workload 
to create increasing foreground load on the system. 
For example, a multiprogramming level of ten means 
that there are ten requests active in the system at 
any given point, either queued at the disk or waiting 
in think time. The OLTP requests are uniformly- 
distributed across the disk’s capacity with a read to 
write ratio of 2:1 and a request size that is a multiple 
of 4 kilobytes chosen from an exponential distribu- 
tion with a mean of 8 kilobytes. Validation experi- 
ments (in [45]) show that this workload is sufficiently 
similar to disk traces of Microsoft’s SQL server run- 
ning TPC-C for the overall freeblock-related insights 
to apply to more realistic OLTP environments. The 
background data mining workload uses free band- 
width to make full scans of the disk’s contents in 
4 KB blocks, completing one scan before starting 
the next. All simulations run for the time required 
for the background data mining workload to com- 
plete ten full disk scans, and the results presented 
are averages across these ten scans. The experiments 
ignore bus bandwidth and record processing over- 
heads, assuming that media scan times dominate; 
this assumption might be appropriate if the mining 
data is delivered over distinct buses to dedicated pro- 
cessors either-on a small mining system or in Active 
Disks. 


6.3 Results 


Figure 8 shows the disk head usage for the fore- 
ground OLTP workload at a range of MPLs and the 
free bandwidth achieved by the data mining task. 
Low OLTP loads result in low data mining through- 
put, because little potential free bandwidth exists 
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Figure 8: Average freeblock-based data mining performance. 
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foreground OLTP workload at various MPLs. (b) shows the overall free bandwidth delivered to the data mining application 
for the same points. (c) shows the disk head usage breakdown with both the foreground OLTP workload and the background 


data mining application. 


when there are few foreground requests. Instead, 
there is a significant amount of idle disk head time 
that could be used for freeblock requests, albeit not 
without some effect on foreground response times. 
Our study here focuses strictly on use of free band- 
width. As the foreground load increases, opportu- 
nities to service freeblock requests are more plenti- 
ful, increasing data mining throughput to about 4.9 
MB/s (21% of the Atlas 10K’s 23MB/s full potential 
bandwidth). This represents a 7x increase in useful 
disk head utilization, from 3% to 24%, and it allows 
the data mining application to complete over 47 full 
“scans per day” [24] of this 9GB disk with no effect 
on foreground OLTP performance. 


However, as shown in Figure 8b, freeblock scheduling 
realizes only half of the potential free bandwidth for 
this environment. As shown in Figure 8c, 18% of the 
remaining potential is lost to extra seek time, which 
occurs when pending freeblock requests only exist on 
a third track (other than the previous and current 
foreground request). The remaining 28% continues 
to be rotational latency, either as part of freeblock 
requests or because no freeblock request could be 
serviced within the available slot. 


Figure 9 helps to explain why only half of the po- 
tential free bandwidth is realized for data mining. 
Specifically, it shows data mining progress and per- 
OLTP-request breakdown as functions of the time 
spent on a given disk scan. The main insight here 
is that the efficiency of freeblock scheduling (i.e., 
achieved free bandwidth divided by potential free 
bandwidth) drops steadily as the set of still-desired 
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background blocks shrinks. As the freeblock sched- 
uler has more difficulty finding conveniently-located 
freeblock requests, it must look further and further 
from the previous and current foreground requests. 
As shown in Figure 9c, this causes extra seek times 
to increase. Unused rotational latency also increases 
as freeblock requests begin to incur some latency and 
as increasing numbers of foreground rotational laten- 
cies are found to be too small to allow any pending 
freeblock request to be serviced. As a result, ser- 
vicing the last few freeblock requests of a full scan 
takes a long time; for example, the last 5% of the 
freeblock requests take 30% of the total time for a 
scan. 


One solution to this problem would be to increase 
the priority of the last few freeblock requests, with a 
corresponding impact on foreground requests. The 
challenge would be to find an appropriate trade- 
off between impact on the foreground and improved 
background performance. 


An alternate solution would be to take advan- 
tage of the statistical nature of many data min- 
ing queries. Statistical sampling has been shown 
to provide accurate results for many queries 
and internal database operations after accessing 
only a (randomly-selected) subset of the total 
dataset [43, 13]. Figure 10 shows the impact of such 
statistical data mining as a function of the percent- 
age of the dataset needed; that is, the freeblock re- 
quest is aborted when enough of the dataset has 
been mined. Assuming that freeblock scheduling 
within the foreground OLTP workload results in suf- 
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Figure 9: Freeblock-based data mining progress for MPL 7. (a) shows the potential and achieved scan progress. (b) 
shows the corresponding instantaneous free bandwidth. (c) shows the usage of potential free bandwidth (i-.e., original OLTP 
rotational latency), partitioning it into free transfer time, extra seek time, and unused latency. As expected, the shape of the 
Free Transfer line in (c) matches that of the achieved instantaneous mining bandwidth in (b). Both exceed the potential free 
bandwidth early in the scan because many foreground OLTP transfers can also be used by the freeblock scheduler for mining 


requests when most blocks are still needed. 


ficiently random data selection or that the sampling 
algorithm is adaptive to sampling biases [13], sam- 
pling can significantly increase freeblock scheduler 
efficiency. When any 95% of the dataset is sufficient, 
efficiency is 40% higher than for full disk scans. For 
80% of the dataset, efficiency is at 90% and data 
mining queries can complete over 90 samples of the 
dataset per day. 


7 Related Work 


System designers have long struggled with disk per- 
formance, developing many approaches to reduce 
mechanical positioning overheads and to amortize 
these overheads over large media transfers. When 
effective, all of these approaches increase disk head 
utilization for foreground workloads and thereby re- 
duce the need for and benefits of freeblock schedul- 
ing; none have yet eliminated disk performance as 
a problem. The remainder of this section discusses 
work specifically related to extraction and use of free 
bandwidth. 


The characteristics of background workloads that 
can most easily utilize free bandwidth are much like 
those that can be expressed well with dynamic set 
[55] and disk-directed I/O [35] interfaces. Specif- 
ically, these interfaces were devised to allow ap- 
plication writers to expose order-independent ac- 
cess patterns to storage systems. Application-hinted 
prefetching interfaces [9, 44] share some of these 


same qualities. Such interfaces may also be appro- 
priate for specifying background activities to free- 
block schedulers. 


Use of idle time to handle background activities is a 
long-standing practice in computer systems. A sub- 
set of the many examples, together with a taxon- 
omy of idle time detection algorithms, can be found 
in [23]. Freeblock scheduling complements exploita- 
tion of idle time. It also enjoys two superior qual- 
ities: (1) ability to make forward progress during 
busy periods and (2) ability to make progress with 
no impact on foreground disk access times. Start- 
ing a disk request during idle time can increase the 
response time of subsequent foreground requests, by 
making them wait or by moving the disk head. 


In their exploration of write caching policies, Biswas, 
et al., evaluate a free prewriting mechanism called 
piggybacking [6]. Although piggybacking only con- 
siders blocks on the destination track or cylinder, 
they found that most write-backs could be com- 
pleted for free across a range of workloads and cache 
sizes. Relative to their work, our work generalizes 
both the freeblock scheduling algorithm and the uses 
for free bandwidth. 


Freeblock scheduling relies heavily on the ability 
to accurately predict mechanical positioning delays 
(both seek times and rotational latencies). The 
firmware of most high-end disk drives now supports 
Shortest-Positioning-Time-First (SPTF) scheduling 
algorithms, which require similar predictions. Based 
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Figure 10: Freeblock-based data mining performance for statistical queries. 
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the disk’s data satisfy the needs of a query scan. Below 60%, achieved free bandwidth exceeds potential free bandwidth because 
of the ability to satisfy freeblock requests from foreground transfers 


on this fact, we are confident that freeblock schedul- 
ing is feasible. However, it remains to be seen 
whether freeblock scheduling can be effective outside 
of disk drive firmware, where complete knowledge of 
current state and internal algorithms is available. 


Freeblock scheduling resembles advanced disk sched- 
ulers for environments with a mixed workload of 
real-time and non-real-time activities. While early 
real-time disk schedulers gave strict priority to real- 
time requests, more recent schedulers try to use 
slack in deadlines to service non-real-time requests 
without causing the deadlines to be missed [53, 41, 
5, 8]. Freeblock scheduling relates to conventional 
priority-based disk scheduling (e.g., [10, 21]) roughly 
as modern real-time schedulers relate to their prede- 
cessors. However, since non-real-time requests have 
no notion of deadline slack, freeblock scheduling 
must be able to service background requests with- 
out extending the access latencies of foreground re- 
quests at all. Previous disk scheduler architectures 
would not do this well for non-periodic foreground 
workloads, such as those explored in this paper. 


While freeblock scheduling can provide free media 
bandwidth, use of such bandwidth also requires some 
CPU, memory, and bus resources. One approach 
to addressing these needs is to augment disk drives 
with extra resources and extend disk firmware with 
application-specific functionality [1, 32, 46]. Poten- 
tially, such resources could turn free bandwidth into 
free functionality; Riedel, et al., [45] argue exactly 
this case for the data mining example of Section 6. 


Another interesting use of accurate access time pre- 
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dictions and layout information is eager writing, or 
remapping new versions of disk blocks to free loca- 
tions very near the disk head [27, 18, 40, 57]. We 
believe that eager writing and freeblock scheduling 
are strongly complementary concepts. Although ea- 
ger writing decreases available free bandwidth dur- 
ing writes by eliminating many seek and rotational 
delays, it does not do so for reads. Further, eager 
writing could be combined with freeblock scheduling 
when using a write-back cache. Finally, as with the 
LFS cleaning example in Section 5, free bandwidth 
represents an excellent resource for cleaning and re- 
organization enhancements to the base eager writing 
approach [57]. 


8 Conclusions 


This paper describes freeblock scheduling, quantifies 
its potential under various conditions, and demon- 
strates its value for two specific application environ- 
ments. By servicing background requests in the con- 
text of mechanical positioning for normal foreground 
requests, 20-50% of a disk’s potential media band- 
width can be obtained with no impact on the orig- 
inal requests. Using simulation, this paper shows 
that this free bandwidth can be used to clean LFS 
segments on busy file servers and to mine data on 
active transaction processing systems. 


These results indicate significant promise, but ad- 
ditional experience is needed to refine and realize 
freeblock scheduling in practice. For example, it re- 
mains to be seen whether freeblock scheduling can 
be implemented outside of modern disk drives, given 
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their high-level interfaces and complex firmware al- 
gorithms. Even inside disk firmware, freeblock 
scheduling will need to conservatively deal with seek 
and settle time variability, which may reduce its ef- 
fectiveness. More advanced freeblock scheduling al- 
gorithms will also be needed to deal with request 
fragmentation, starvation, and priority mixes. 
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Abstract 


Storage Latency Estimation Descriptors, or SLEDs, are 
an API that allow applications to understand and take 
advantage of the dynamic state of a storage system. By 
accessing data in the file system cache or high-speed 
storage first, total YO workloads can be reduced and 
performance improved. SLEDs report estimated data 
latency, allowing users, system utilities, and scripts to 
make file access decisions based on those retrieval time 
estimates. SLEDs thus can be used to improve individ- 
ual application performance, reduce system workloads, 
and improve the user experience with more predictable 
behavior. 


We have modified the Linux 2.2 kernel to support 
SLEDs, and several Unix utilities and astronomical ap- 
plications have been modified to use them. As a result, 
execution times of the Unix utilities when data file sizes 
exceed the size of the file system buffer cache have been 
reduced from 50% up to more than an order of mag- 
nitude. The astronomical applications incurred 30-50% 
fewer page faults and reductions in execution time of 
10-35%. Performance of applications which use SLEDs 
also degrade more gracefully as data file size grows. 


1 Introduction 


Storage Latency Estimation Descriptors, or SLEDs, ab- 
stract the basic characteristics of data retrieval in a 
device-independent fashion. The ultimate goal is to cre- 
ate a mechanism that reports detailed performance char- 
acteristics without being tied to a particular technology. 


* Author’s current address: Nokia, Santa Cruz, CA. 


Storage systems consist of multiple devices with differ- 
ent performance characteristics, such as RAM (e.g., the 
operating system’s file system buffer cache), hard disks, 
CD-ROMs, and magnetic tapes. These devices may be 
attached to the machine on which the application is run- 
ning, or may be attached to a separate server machine. 
All of these elements communicate via a variety of inter- 
connects, including SCSI buses and ethernets. As sys- 
tems and applications create and access data, it moves 
among the various devices along these interconnects. 


Hierarchical storage management (HSM) systems with 
capacities up to a petabyte currently exist, and systems 
up to 100PB are currently being designed [LLJR99, 
Shi98]. In such large systems, tape will continue to play 
an important role. Data is migrated to tape for long- 
term storage and fetched to disk as needed, analogous to 
movement between disk and RAM in conventional file 
systems. A CD jukebox or tape library automatically 
mounts media to retrieve requested data. 


Storage systems have a significant amount of dynamic 
state, a result of the history of accesses to the system. 
Disks have head and rotational positions, tape drives 
have seek positions, autochangers have physical posi- 
tions as well as a set of tapes mounted on various drives. 
File systems are often tuned to give cache priority to re- 
cently used data, as a heuristic for improving future ac- 
cesses. As a result of this dynamic state, the latency 
and bandwidth of access to data can vary dramatically; 
in disk-based file systems, by four orders of magnitude 
(from microseconds for cached, unmapped data pages, 
to tens of milliseconds for data retrieved from disk), in 
HSM systems, by as much as eleven (microseconds up 
to hundreds of seconds for tape mount and seek). 


File system interfaces are generally built to hide this 
variability in latency. A read () system call works the 
xame for data to be read from the file system buffer cache 
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Figure 1: SLEDs and hints in the storage system stack 


as for data to be read from disk. Only the behavior is 
different; the semantics are the same, but in the first case 
the data is obtained in microseconds, and in the second, 
in tens of milliseconds. 


CPU performance is improving faster than storage de- 
vice performance. It therefore becomes attractive to ex- 
pend CPU instructions to make more intelligent deci- 
sions conceming I/O. However, with the strong abstrac- 
tion of file system interfaces, applications are limited in 
their ability to contribute to I/O decisions; only the sys- 
tem has the information necessary to schedule I/Os. 


SLEDs are an API that allows applications and libraries 
to understand both the dynamic state of the storage sys- 
tem and some elements of the physical characteristics of 
the devices involved, in a device-independent fashion. 
Using SLEDs, applications can manage their patterns of 
V/O calls appropriately. They may reorder or choose not 
to execute some I/O operations. They may also report 
predicted performance to users or other applications. 


SLEDs can be constrasted to file system hints, as shown 
in Figure 1. Hints are the flow of information down the 
storage system stack, while SLEDs are the flow informa- 
tion up the stack. The figure is drawn with the storage 
devices as well as the storage system software partic- 
ipating. In current implementations of these concepts, 
the storage devices are purely passive, although their 
characteristics are measured and presented by proxy for 
SLEDs. 
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This paper presents the first implementation and mea- 
surement of the concept of SLEDs, which we proposed 
in an earlier paper [Van98]. We have implemented the 
SLEDs system in kernel and jibrary code under Linux 
(Red Hat 6.0 and 6.1 with 2.2 kernels), and modified 
several applications to use SLEDs. 


The applications we have modified demonstrate the dif- 
ferent uses of SLEDs. wc and grep were adapted 
to reorder their I/O calls based on SLEDs information. 
The performance of wc and grep have been improved 
by 50% or more over a broad range of file sizes, and 
more than an order of magnitude under some conditions. 
findis capable of intelligently choosing not to perform 
certain I/Os. The GUI file manager gmc reports esti- 
mated retrieval times, improving the quality of informa- 
tion users have about the system. 


We also modified LHEASOFT, a large, complex suite 
of applications used by professional astronomers for im- 
age processing [NASOO]. One member of the suite, 
fimhisto, which copies the data file and appends a 
histogram of the data to the file, showed a reduction in 
page faults of 30-50% and a 15-25% reduction in execu- 
tion time for files larger than the file system buffer cache. 
fimgbin, which rebins an image, showed a reduction 
of 11-35% in execution time for various parameters. The 
smaller improvements are due in part to the complexity 
of the applications, relative to wc and grep. 


The next section presents related work. This is followed 
by the SLEDs design, then details of the implementa- 
tion, and results. The paper ends with future work and 
conclusions. 


2 Related Work 


The history of computer systems has generally pushed 
toward increasingly abstract interfaces hiding more of 
the state details from applications. In a few instances, 
HSM systems provide some ability for HSM-aware ap- 
plications to determine if files are online (on disk) or 
offline (on tape or other low levels of the hierarchy). Mi- 
crosoft’s Windows 2000 (formerly NTS) [vI99], TOPS- 
20, and the RASH system [HP89] all provide or pro- 
vided a single bit that indicates whether the file is online 
or offline. SLEDs extends this basic concept by provid- 
ing more detailed information. 


Steere’s file sets [Ste97] exploit the file system cache on 
a file granularity, ordering access to a group of files to 
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present the cached files first. However, there is no notion 
of intra-file access ordering. 


Some systems provide more direct control over what 
pages are selected for prefetching or for cache replace- 
ment. Examples include TOPS-10 [Dig76] and TOPS- 
20 and Mach’s user-definable pagers, Cao’s application- 
controlled file caching [CFKL96], and the High Perfor- 
mance Storage System (HPSS) [WC95]. 


Still other systems have improved system performance 
by a mechanism known as hints. Hints are flow of in- 
formation from the application to the system about ex- 
pected access orders and data reuse. They are, in effect, 
the inverse of SLEDs, in which information flows from 
the system to the application. Hints may allow the sys- 
tem to behave more efficiently, but do not allow the ap- 
plication to participate directly in I/O decisions, and can- 
not report expected I/O completion times to the appli- 
cation or user. Good improvements have been reported 
with hints over a broad range of applications [PGGt95]. 
Reductions in elapsed time of 20 to 35 percent for a 
single-disk system were demonstrated, and as much as 
75 percent for ten-disk systems. Hints cannot be used 
across program invocations, or take advantage of state 
left behind by previous applications. However, hints 
can help the system make more intelligent choices about 
what data should be kept in cache as an application runs. 


Hillyer and Silberschatz developed a detailed device 
model for a DLT tape drive that allows applications 
to schedule I/Os effectively [HS96a, HS96b]. Sandsta 
and Midstraum extended their model, simplifying it and 
making it easier to use [SM99]. The goal is the same 
as SLEDs, effective application-level access ordering, 
but is achieved in a technology-aware manner. Such 
algorithms are good candidates to be incorporated into 
SLEDs libraries, hiding the details of the tape drive from 
application writers. 


For disk drives, detailed models such as Ruemm- 
ler’s [RW94] and scheduling work such as Worthing- 
ton’s [WGP94] may be used to enhance the accuracy of 
SLEDs. 


Real-time file systems for multimedia, such as Ander- 
son’s continuous media file system [AOG92] and SGI’s 
XFS [SDH 96], take reservation requests and respond 
with an acceptance or rejection. SLEDs could be inte- 
grated with such systems to provide substrate (storage 
and transfer subsystems communicating their character- 
istics to the file system to improve its decision-making 
capabilities), or to increase the usefulness of the in- 
formation provided to applications about their requests. 


Such systems calculate information similar to SLEDs in- 
ternally, but currently do not expose it to applications, 
where it could be useful. 


Distributed storage systems, such as Andrew and 
Coda [Sat90], Tiger [BFD97], Petal [LT96], and 
xFS [ADN1T95], present especially challenging prob- 
lems in representing performance data, as many factors, 
including network characteristics and local caching, 
come into play. We propose that SLEDs be the vocab- 
ulary of communication between clients and servers as 
well as between applications and operating systems. 


Mobile systems, including PDAs and cellular phones, 
are an especially important area where optimizing for la- 
tency and bandwidth are important [FGBA96]. Perhaps 
the work most like SLEDs is Odyssey [NSNt97]. Ap- 
plications make resource reservations with SLEDs-like 
requests, and register callbacks which can return a new 
value for a parameter such as network latency or band- 
width as system conditions change. 


Attempts to improve file system performance through 
more intelligent caching and prefetching choices include 
Kroeger’s [Kro00] and Griffioen’s [GA95]. Both use file 
access histories to predict future access patterns so that 
the kernel can prefetch more effectively. Kroeger reports 
I/O wait time reductions of 33 to 90 percent under var- 
ious conditions on his implementation, also done on a 
Linux 2.2.12 kemel. 


Kotz has simulated and studied a mechanism called disk- 
directed I/O for parallel file systems [Kot94]. Compute 
processors (CPs) do not adjust their behavior depending 
on the state of the I/O system, but collectively aggregate 
requests to I/O processors (IOPs). This allows the I/O 
processors to work with deep request queues, sorting for 
efficient access to achieve a high percentage of the disk 
bandwidth. Unlike SLEDs, the total I/O load is not re- 
duced by taking dynamic advantage of the state of client 
(CP) caches, though servers (IOPs) may gain a similar 
advantage by ordering already-cached data to be deliv- 
ered first. 


Asynchronous I/O, such as that provided by POSIX /O 
or VMS, can also reduce application wait times by over- 
lapping execution with I/O. In theory, posting asyn- 
chronous read requests for the entire file, and process- 
ing them as they arrive, would allow behavior simi- 
lar to SLEDs. This would need to be coupled with 
a system-assigned buffer address scheme such as con- 
tainers [PA94], since allocating enough buffers for files 
larger than memory would result in significant virtual 
memory thrashing. 
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struct sled { 
long offset; /* into the file */ 

long length; /* of the segment */ 

float latency; /* in seconds */ 


float bandwidth; /* in bytes/sec */ 


Figure 2: SLED structure 


3 SLEDs Design 


The basic elements of the Storage Latency Estimation 
Descriptor structure are shown in Figure 2. SLEDs rep- 
resent the estimated latency to retrieve specific data ele- 
ments of a file. A SLED consists of the byte offset within 
the file, the length in bytes of this section, and the per- 
formance estimates. The estimates are the latency to the 
first byte of the section and the bandwidth at which data 
will arrive once it has begun. Floating point numbers are 
used to represent the latency because the necessary range 
exceeds that of a normal integer. We chose floating point 
numbers for bandwidth for consistency of representation 
and ease of arithmetic. 


Different sections (usually blocks) ofa file may have dif- 
ferent retrieval characteristics, and so will be represented 
by separate SLEDs. For large files, as a file is used and 
reused, the state of a file may ultimately be represented 
by a hundred or more SLEDs. Moving from the begin- 
ning of the file to the end, each discontinuity in storage 
media, latency, or bandwidth results in another SLED in 
the representation. 


Applications take advantage of the information in a 
SLED in one of three possible ways: reordering I/Os, 
pruning I/Os, or reporting latencies. Each is detailed in 
the following subsections. 


3.1 Reordering I/Os 


By accessing the data currently in primary memory first, 
and items that must be retrieved from secondary or ter- 
tiary storage later, the number of physical I/Os that must 
be performed may be reduced. This may require algo- 
rithmic changes in applications. 


Figure 3 shows how two linear passes across a file be- 
have with normal LRU caching when the file is larger 
than the cache size. A five-block file is accessed using 
a cache which is only three blocks. The contents of the 
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Figure 3: Movementof data among storage levels during 
two linear passes 


cache are shown as file block numbers, with “e” being 
empty. The second pass gains no advantage from the 
data cached as a result of the first pass, as the tail of 
the file is progressively thrown out of cache during the 
current pass. The two passes may be within a single ap- 
plication, or two separate applications run consecutively. 
Behavior is similar whether the two levels are memory 
and disk, as in a normal file system, or disk and tape, as 
in any hierarchical storage management system which 
caches partial files from tape to disk. 


By using SLEDs, the second pass would be able to rec- 
ognize that reading the tail of the file first is advanta- 
geous. In this case, blocks 3, 4, and 5 would be known 
to be cached, and read before blocks | and 2. The total 
number of blocks retrieved from the lower storage level 
in this second pass would only be two instead of five. 


3.2 Pruning I/Os 


By algorithmically choosing not to execute particular 
I/Os, an application may improve its performance by or- 
ders of magnitude, as well as be a better citizen by re- 
ducing system load. 


A simple example that combines both pruning and re- 
ordering is an application which is looking for a specific 
record in a large file or set of files. If the desired record’s 
position is toward the end of the data set as normally 
ordered, but already resides in memory rather than on 
disk or tape (possibly as a result of recent creation or 
recent access), it may be among the first accessed. As 
a result, the application may terminate without request- 
ing data which must be retrieved for disk or tape, and 
performance may improve by an order of magnitude or 
more. 
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It may also simply be desirable to run an application with 
the minimum number of I/O operations, even at the cost 
of reduced functionality. This may be applied in, for 
example, environments that charge per I/O operation, as 
used to be common for timesharing systems. 


3.3 Reporting Latency 


Applications in several categories depend on or can be 
improved by an ability to predict their performance. 
Quality of service with some real-time component is the 
most obvious but not the only such category; any I/O- 
intensive application on which a user depends for inter- 
active response is a good candidate to use SLEDs. 


Systems that provide quality of service guarantees gen- 
erally do so with a reservation mechanism, in which ap- 
plications request a specific performance level, and the 
system responds with a simple yes or no about its ability 
to provide the requested performance. Once the reser- 
vation has begun, QoS systems rarely provide any addi- 
tional information about the arrival of impending data. 


Most applications which users interact with directly are 
occasionally forced to retrieve significant amounts of 
data, resulting in the appearance of icons informing 
the user that she must wait, but with no indication of 
the expected duration. Better systems (including web 
browsers) provide visible progress indicators. Those in- 
dicators are generally estimated based on partial retrieval 
of the data, and so reflect current system conditions, but 
cannot be calculated until the data transfer has begun. 
Dynamically calculated estimates can be heavily skewed 
by high initial latency, such as in an HSM system. Using 
SLEDs instead provides a clearer picture of the relation- 
ship of the latency and bandwidth, providing comple- 
mentary data to the dynamic estimate, and can be pro- 
vided before the retrieval operation is initiated. 


Both types of applications above have a common need 
for a mechanism to communicate predicted latency of 
I/O operations from storage devices to operating systems 
to libraries to applications. SLEDs is one proposal for 
the vocabulary of such communication. 


3.4 Design Limitations 


SLEDs, as currently implemented, describe the state of 
the storage system at a particular instant. This state, 
however, varies over time. Mechanical positioning of 


devices changes, and cached data can change as a result 
of I/Os performed by the application, other applications 
or system services, or even other clients in a distributed 
system. Adding a lock or reservation mechanism would 
improve the accuracy and lifetime of SLEDs by control- 
ling access to the affected resources. 


Another possibility is to include in the SLEDs them- 
selves some description of how the system state will 
change over time, such as a program segment that appli- 
cations could use to predict which pages of a file would 
be flushed from cache based on current page replace- 
ment algorithms. 


4 Implementation 


The implementation of SLEDs includes some kernel 
code to assess and report the state of data for an open 
file descriptor, an ioct1 call for communicating that 
information to the application level, and a library that 
applications can use to simplify the job of ordering I/O 
requests based on that information. 


This section describes the internal details of the imple- 
mentation of the kernel code, library and application 
modifications. 


4.1 Kernel 


We modified the Linux kernel to determine which device 
the pages for a file reside on, and whether or not the 
pages are currently in memory. All of the changes were 
made in the virtual file system (VFS) layer, independent 
of the on-disk data structure of ext2 or ISO9660. 


A sleds_table, kept in the kernel, is filled by call- 
ing a script from /etc/rc.d/init.d every time 
the machine is booted. The sleds_table has a la- 
tency and bandwidth entry for every storage device, 
as well as NFS-mounted partitions and primary mem- 
ory. The latency and bandwidth for both local and 
network file systems are obtained by running the 1m- 
bench benchmark [MS96]. The current implementa- 
tion keeps only a single entry per device; for better accu- 
racy, entries which account for the different bandwidths 
of different disk zones will be added in a future ver- 
sion [Van97]. The script fills the kernel table via a new 
ioctl call, FSLEDS_FILL, added to the generic file 
system ioctl. 
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Applications can retrieve the SLEDs for a file using an- 
other new ioctl call, FSLEDS_GET, which returns a 
vector of SLEDs. To build the vector of SLEDs and 
their latency and bandwidth, each virtual memory page 
of the data file is checked. After the kernel finds out 
where a page of the data file resides, it assigns a latency 
and bandwidth from the sleds_tab1e to this page. If 
consecutive pages have the same latency and bandwidth, 
i.e., they are in the same storage device, they are grouped 
into one SLED. During this process, the length and off- 
set of the SLEDs are also assigned. 


4.2 Library and API 


The means of communication between application space 
and kernel space is via ioct1 calls which return only 
SLEDs. This form is not directly very useful, so a library 
was also written that provides additional services. The 
library provides a routine to estimate delivery time for 
the entire file, and several routines to help applications 
order their I/O requests efficiently. 


The three primary library routines for reordering I/O 
are sleds_pick_init, sleds_pickmnext_read, 
and sleds_pick_finish. Applications first open 
the file, then call sleds_pick_init, which uses 
ioctl to retrieve the SLEDs from the kernel. 
sleds_pickmnext-read is called repeatedly to ad- 
vise the application where to read from the file next. 
The application then moves to the recommended posi- 
tion via lseek, and calls read to retrieve the next 
chunk of the file. The preferred size of the chunks the 
application wants to read is specified as an argument to 
sleds_pick_init, and sleds_pick_next_read 
will return chunks that size or smaller. The application 
is presumed to be following the library’s advice, but it 
does not check. The library will return each chunk of 
the file exactly once. 


The library checks for the lowest latency among unseen 
chunks, then chooses to return the chunk with the lowest 
file offset among those with equivalent latencies. In the 
simple case of a disk-based file system with a cold cache, 
this algorithm will degenerate to linear access of the file. 
As currently implemented, the SLEDs are retrieved from 
the kernel only when sleds_pick-init is called. Re- 
freshing the state of those SLEDs occasionally would al- 
low the library to take advantage of any changes in state 
caused by e.g. file prefetching. 


A library routine, sleds_total_delivery-_time, 
provides an estimate of the amount of time required 
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Figure 4: Adjusting SLEDs for record boundaries 


to read the entire file, for applications only interested 
in reporting or using that value. It takes an argu- 
ment, attack-plan, which currently can be either 
SLEDS_LINEAR or SLEDS_BEST, to describe the in- 
tended access pattern. 


Generally, the library and application are most effi- 
cient if these accesses can be done on page boundaries. 
However, many applications are interested in variable- 
sized records, such as lines of text. An argument to 
sleds_pick_init allows the application to ask for 
record-oriented SLEDs, and to specify the character 
used to identify record boundaries. 


The library prevents applications from running over the 
edge of a low-latency SLED and causing data to be 
fetched from higher-latency storage when in record- 
oriented mode. It does this by pulling in the edges of the 
SLEDs from page boundaries to record boundaries, as 
shown in Figure 4. The leading and trailing record frag- 
ments are pushed out to the neighboring SLEDs, which 
are higher latency. This requires the SLEDs library to 
perform some I/O itself to find the record boundaries. 
In the figure, the gray areas are high-latency SLEDs in 
a file, and the white area is a low-latency SLED. The 
numbers above represent the access ordering assigned 
by the library. The upper line is before adjustment for 
record boundaries, and the lower line is after adjustment 
for variable-sized records with a linefeed record separa- 
tor. 


43 Applications 


Figure 5 shows pseudocode for an application using the 
SLEDs pick library. After initializing the SLEDs library, 
the application loops, first asking the library to select a 
file offset, then seeking to that location and reading the 
amount recommended. 


Applications that have been modified to use SLEDs in- 
clude the GNU versions of the Unix system utilities wc, 
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function 
sleds_pick-_init 


sleds _pick_next.read 
sleds_pick_finish 
sleds _total delivery time 


file descriptor, preferred buffer size 
file descriptor, buffer size, record flag 





file descriptor, attack plan 


return value 

buffer size 

read location, size 
(none) 

estimated delivery time 


arguments 


file descriptor 


Table 1: SLEDs library routines 


int offset, nbytes, Remain; 
int FileSize, fd; 


char buffer[BUFSIZE]; 


fd = open(FileName, flags); 
sleds_pick_init(fd,BUFSIZE) ; 
for( Remain = FileSize ; Remain ; 
Remain -= nbytes ){ 
nelem = MIN(Remain,BUFSIZE) ; 
sleds_pick_next_read(fd, &offset, 
&nbytes) ; 


lseek(fd, offset, SEEK_SET) ; 
read(fd, buffer, nbytes); 
process_data(buffer, nbytes) ; 


} 
sleds_pick_finish(fd) ; 
close (fd) ; 


Figure 5: Application pseudocode 


grep, and find, and the GNOME file manager gmc. 
These examples demonstrate the three ways in which 
SLEDs can be useful. The first two use SLEDs to re- 
order I/O operations, gaining performance and reducing 
total I/O operations by taking advantage of cached data 
first, using algorithms similar to figure 5. find is modi- 
fied to include a predicate which allows the user to select 
files based on total estimated latency (either greater than 
or less than a specified value). This can be used to prune 
expensive I/O operations. gmc reports expected file re- 
trieval times to the user, allowing him or her to choose 
whether or not to access the file. 


We have also adapted two members of the LHEASOFT 
suite of applications, fimhisto and fimgbin, to use 
SLEDs. NASA’s Goddard Space Flight Center sup- 
ports LHEASOFT, which provides numerous utilities 
for the processing of images in the Flexible Image Trans- 
port System, or FITS, format used by professional as- 
tronomers. The FITS format includes image metadata, 
as well as the data itself. 


4.4 Implementation Limitations 


The current implementation provides only a basic esti- 
mate of latency based on device characteristics, with no 
indication of current head or rotational position. The 
primary information provided is a distinction between 
levels of the storage system, with estimates of the band- 
width and latency to retrieve data at each level. This in- 
formation is effective for disk drives, but will need to be 
updated for tape drives. Future extensions are expected 
to provide more detailed mechanical estimates. 


5 Results and Analysis 


The benefits of SLEDs include both useful predictability 
in I/O execution times, and improvements in execution 
times for those applications which can reorder their I/O 
requests. In this section we discuss primarily the latter, 
objectively measurable aspects. 


We hypothesize that reordering I/O requests according 
to SLEDs information will reduce the number of hard 
page faults, that this will translate directly to decreased 
execution times, and that the effort required to adopt 
SLEDs is reasonable. To validate these hypotheses, we 
measured both page faults and elapsed time for the mod- 
ified applications described above, and report the num- 
ber of lines of code changed. 


SLEDs are expected to benefit hierarchical storage man- 
agement systems, with their very high latencies, more 
than other types of file systems. The experiments shown 
here are for normal on-disk file system structures cached 
in primary memory. Thus, the results here can be viewed 
as indicative of the positive benefits of SLEDs, but gains 
may be much greater with HSM systems. 
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5.1 Experimental Setup 


To measure the effect of reordering data requests, the 
average time and page faults taken to execute the ap- 
plications with SLEDs were plotted against the values 
without SLEDs. We used the system time command 
to do the measurements. Tests were run on wc and 
grep for data files that reside on hard disk, CD-ROM, 
and NFS file systems. grep was tested in two differ- 
ent modes, once doing a full pass across the data file, 
and once terminating on the first match (using the -q 
switch). The modified LHEASOFT applications were 
run only against hard disk file systems. 


During the test runs, no other user activity was present 
on the systems being measured. However, some vari- 
ability in execution times is unavoidable, due to the 
physical nature of I/O and the somewhat random nature 
of page replacement algorithms and background system 
activity. All runs were done twelve times (representing 
a couple of days’ execution time in total) and 90% con- 
fidence intervals calculated. The graphs show the mean 
and confidence intervals, though in some cases the con- 
fidence intervals are too small to see. 


Data was taken for test file sizes of 8 to 128 megabytes, 
in multiples of eight, for most of the experiments. With 
a primary memory size of 64 MB, this upper bound is 
twice the size of primary memory and roughly three 
times the size of the portion of memory available to 
cache file pages. We expect no surprises in the range 
above this value, but a gradual decrease in the relative 
improvement. 


Because SLEDs are intended to expose and take advan- 
tage of dynamic system state, all experiments were done 
with a warm file cache. A warm cache is the natural 
state of the system during use, since files that have been 
recently read or written are commonly accessed again 
within a short period of time. SLEDs provide no benefit 
in the case of a completely cold RAM cache for a disk- 
based file system. The first run to warm the cache was 
discarded from the result. The runs were done repeat- 
edly in the same mode, so that, for example, the second 
run of grep without SLEDs found the file system buffer 
cache in the state that the first run had left it. 


Tables 2 and 3 contain the device characteristics used 
during these experiments. 


Table 4 lists the number of lines of source code modified 
in each application. The “src” columns are lines of code 
in the main application source files. The “lib” are lines 
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memory 
hard disk 


throughput 
48 MB/s 
9.0 MB/s 
2.8 MB/s 
1.0 MB/s 


175 nsec 
18 msec 
130 msec 
270 msec 


CD-ROM 
NFS 





Table 2: Storage levels used for measuring Unix utilities 


latency | throughput 


87 MB/s 
7.0 MB/s 


memory 


hard disk 





Table 3: Storage levels used for measuring LHEASOFT 
utilities 


of code in additional, shared, linked-in libraries. The 
“modified” columns are lines of code added or modified, 
and the “total” columns are the totals. The LHEASOFT 
cfitsio library modifications are shared, used in both 
fimhisto and fimgbin. The grep modifications 
are most extensive because of the need to buffer and sort 
output in a different fashion. 


5.2. Unix Utilities 


In gmc, a new simple panel is added to the file prop- 
erties dialog box, as shown in figure 6. This follows 
closely the implementation of other windows such as 
the “general” and “URL” properties panels. The SLEDs 
panel reports the length, offset, latency, and bandwidth 
of each SLED, as well as the estimated total delivery 
time for the file. Users can interactively use this panel to 
decide whether or not to access files; this is expected to 
be especially useful in HSM systems and low-bandwidth 
distributed systems. The same approach could be used 
with a web browser, if HTTP were extended to support 
SLEDs across the network. 


application lib lines of code 


src lines of code 


grep 

wc 

find 

gmc 
cfitsio 
fimhisto 
fimgbin 


Table 4: Lines of code modified 


USENIX Association 


Statistics Options | Permissions SLEDs 


| Estimated total dellvery tme: 13280095 seconds 


| SLED bandwidth tatency offset length 
| 
| 0 3000000.000000 0.018000 OD 03175000 

1 48000000.000000 0.000000 0x3175000 043000 
2 9000000.000000 0.018000 Ox3 17000 @«75000 
| 
| 

BP OK | X Cancel 








Figure 6: gmc file properties panel with SLEDs 


The applications wc and grep implemented with 
SLEDs have a switch on the command that allows the 
user to choose whether or not to use SLEDs. If the 
SLEDs switch is specified, instead of accessing the data 
file from the beginning to the end, the application will 
call sleds_pick_init, sleds_pick_next_read 
and sleds_pick_finish, and use them as described 
in section 4.2. 


For we, since the order of data access is not significant, 
little overhead is generated in modifying the code. For 
applications where the order of data access is influential 
in code design, such as grep, more code changes are 
needed and as a result may have heavier execution over- 
head. In our implementation, most of the design with 
SLEDs is adopted from the one without SLEDs. How- 
ever, unless the user chooses not to output the matches, 
the result will have to be output to stdout in the order 
that they appear in the file. To deal with this, we have 
to store a match in a linked list when traversing the data 
file in the order recommended by SLEDs. We sort the 
matches in the end by their offset in the file and then 
dump them to stdout. As a result, switches such as 
-A, -B, -b, and -n had to be reimplemented. 


Consider searching a large source tree, such as the Linux 
kernel. Programmers may do find -exec grep 
(which runs the grep text search utility for every file 
in a directory tree that matches certain criteria, such as 
ending in .c) while looking for a particular routine. If 
the routine is near the end of the set of files as normally 
scanned by find, or if the user types control-C after 


seeing what he wants to see, the entry may be cached 
but earlier files may already have been flushed. Repeat- 
ing the operation, then, causes a complete rescan and 
fetch from high-latency storage. The first author often 
does exactly this, and the SLEDs-aware find allows 
him to search cache first, then higher latency data only 
as needed. 


Standard find provides a switch that stops it from 
crossing mount points as it searches directory trees. This 
is useful to, for example, prevent find from running on 
NFS-mounted partitions, where it can overload a server 
and impair response time for all clients. On HSM sys- 
tems, users may wish to ignore all tape-resident data, or 
to read data from a tape currently mounted on a drive, but 
ignore those that would require mounting a new tape. In 
wide-area systems, users may wish to ignore large files 
that must come across the network. 


In our modified find, the user can choose to find 
files that have a total delivery time of less than, equal 
to, or greater than a certain time. find -latency 
+n looks for files with more than n seconds total 
retrieval time, n means exactly n seconds and -n 
means less than n seconds. mn or Mr instead of n 
can be used for units of milliseconds, and un or Un 
used for microseconds. The SLEDs library routine 
sleds_total_delivery_time was used for this 
comparison. Only 2 extra routines (less than 100 lines 
of code) were needed to add SLEDs to find and all 
functionality has been kept the same. These two extra 
routines work and were implemented similarily to other 
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Figure 7: wc times over NFS, with and without SLEDs, 
warm cache. The legends on the graphs are correct, but 
somewhat difficult to follow; look first for the plus signs 
and Xes. The dashed and solid lines in the legend refer 
to the error bars, not the data lines. 


Timo reflo of wo/w SLEDs [fornia we 


Improvement ratlo 











File size (MB) 


Figure 8: wc time ratios (speedup) over NFS, with and 
without SLEDs, warm cache 


predicates such as -atime. 


Figure 7 presents the execution times for wc against file 
size onan NFS file system with and without SLEDs. As 
to be expected, SLEDs starts to show an advantage at file 
sizes over about 5OMB as the cache becomes unable to 
hold the entire file. The difference in execution time re- 
mains about constant afterwards since the average usage 
of cached data by SLEDs is expected to be determined 
by the cache size, which is constant. As a result, we 
have the best percentage gain at around 60MB. We also 
noticed a very consistent performance gain, as shown by 
the small error bars in the plot. Figure 8 is the ratio of 
the two curves in Figure 7. The execution time without 
SLEDs is divided by the execution time with SLEDs, 
providing a speedup number. As we can see, this ratio 


Pagelaulle for cdrom we wwo SLEDS 


Page Faults 





» » 40 * 60 70 60 80 100 
File size (MB) 


Figure 9: wc page faults on CD-ROM, with and without 
SLEDs, warm cache 


peaks at around 60MB at a value of as large as 4.5, and 
can be comfortably considered to be a 50% or better im- 
provement over a broad range. 


Figure 9 plots the pagefaults for wc against file size on 
cdrom with and without SLEDs. As to be expected, this 
result shows a close correlation with the execution time. 
Without SLEDs, both the execution time and pagefaults 
increase sharply. With SLEDs, the increase in both is 
gradual. 


Figure 10 plots the execution time for grep for all 
matches against file size on cdrom with and without 
SLEDs. Although there is a small amount of overhead 
for small files, grep also demonstrated a very favorable 
gain of about 15 seconds for CD-ROM for large files. 
This can be interpreted as the time taken to fill the file 
cache from CD when SLEDs are not used, as the ap- 
plication derives essentially no benefit from the cached 
data in this case. 


Because this approach requires buffering matches be- 
fore output, if the percentage of matches is large, per- 
formance can be hurt by causing additional paging. All 
experiments presented here are for small match percent- 
ages (kilobytes out of megabytes) output in the correct 
order. 


The increase in execution time for small files is all CPU 
time. This is due to the additional complexity of record 
management with SLEDs, and to more data copying. 
We used read (), rather than mmap (), which does not 
copy the data to meet application alignment criteria. An 
mmap-friendly SLEDs library is feasible, which should 
reduce the CPU penalty. The increase appears large in 
percentage terms, but is a small absolute value. Regard- 
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Figure 10: Execution time of grep for all matches, CD- 
ROM, warm cache 


less, one of the premises of SLEDs is that modest CPU 
increases are an acceptable price to pay for reduced I/O 
loads. 


Figure 11 plots the execution time for grep for the first 
match against file size on an ext2 file system (local hard 
disk) with and without SLEDs, for a single match that 
was placed randomly in the test file. The first match ter- 
mination, if it finds a hit on cached data, can run with- 
out executing any physical Y/O at all. Because the ap- 
plication reads all cached data first when using SLEDs, 
it has a higher probability of terminating before doing 
any physical I/O. The non-SLEDs run is often forced to 
do lots of I/O because it reads from the beginning of the 
file rather than reading cached data first, regardless of 
location. Quite dramatic speedups can therefore occur 
when using SLEDs, relative to a non-SLEDs run. This 
is the ideal benchmark for SLEDs in terms of individual 
application performance. 


The execution time ratio for grep with the first match 
against file size on the ext2 file system, with and without 
SLEDs, is shown in Figure 12. In addition, we have 
computed the cumulative distribution function (CDF) 
for grep for the first match on an NFS file system with 
and without SLEDs, as shown in Figure 13. 


The large error bars in Figure 11 for the case without 
SLEDs are indicative of high variability in the execution 
time caused by poor cache performance. The cumulative 
distribution function for execution times shown in Fig- 
ure 13 suggests that grep without SLEDs gained essen- 
tially no benefit from the fact that a majority of the test 
file is cached. 
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Figure 11: Execution time of grep for one match, ext2 
FS, warm cache 
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Figure 12: Ratio of mean execution time (speedup) of 
grep for one match, ext2 FS, warm cache, with and 
without SLEDs 
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Figure 13: Cumulative distribution of execution time of 
grep for one match, NFS, warm cache, for a 64MB file 


4th Symposium on Operating Systems Design and Implementation 113 


114 


Elapsed time tor FIMHISTO with/without SLEDs 


Execution time (seconds) 
SSIES 


8 





File etze(MB) 


Figure 14: Elapsed time for fimhisto, ext2 FS, warm 
cache 


5.3. LHEASOFT 


fimhisto copies an input data image file to an output 
file and appends an additional data column containing 
a histogram of the pixel values. It is implemented in 
three passes. The first pass copies the main data unit 
without any processing. The second pass reads the data 
again (including performing a data format conversion, if 
necessary) to prepare for binning the data into the his- 
togram. The third pass performs the actual binning op- 
eration, then appends the histogram to the output file. 
This three-pass algorithm resulted in observed cache be- 
havior like that shown in Figure 3. 


We adapted fimhisto to use SLEDs in the second and 
third passes over the data, reordering the pixels read and 
processed to take advantage of cached data. We imple- 
mented an additional library for LHEASOFT that allows 
applications to access SLEDs in units of data elements 
(usually floating point numbers), rather than bytes; the 
calls are the same, with ££_ prepended. Tests were per- 
formed only on anext2 file system, and only for file sizes 
up to 64 MB. 


fimhisto showed somewhat lower gains than wc and 
grep, due to the complexity of the application, but still 
provided a 15-25% reduction in elapsed time and 30- 
50% reduction in page faults on files of 48 to 64 MB. 
Figure 14 shows the familiar pattern of SLEDs offering 
a benefit above roughly the file system buffer cache size. 
fimhisto’s I/O workload is one fourth writes, which 
SLEDs does not benefit, and includes data format con- 
version as well. These factors contribute to the differ- 
ence in performance gains compared to the above appli- 
cations. 


Efepsed time tor FIMGBIN with/without SLEDa 
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Figure 15: Elapsed time for fimgbin, ext2 FS, warm 
cache, 4x data reduction 


We modified £imgbin to reorder the reads on its in- 
put file according to SLEDs. fimgbin rebins an im- 
age with a rectangular boxcar filter. The amount of data 
written is smaller than the input by a fixed factor, typi- 
cally four or 16. It canalso be considered representative 
of utilities that combine multiple input images to create 
a single output image. The main fimgbin code is in 
Fortran, so we added Fortran bindings for the principal 
SLEDs library functions. 


Figure 15 shows elapsed time for fimgbin. It shows 
an eleven percent reduction in elapsed time with SLEDs 
for a data reduction factor of four on file sizes of 48MB 
or more. This is smaller than the benefit to fimhisto, 
despite similar reductions of 30-50% in pagefaults. We 
believe this is due to differences in the write path of the 
array-based code, which is substantially more complex 
and does more internal buffering, partially defeating our 
attempts to fully order I/Os. For a data reduction fac- 
tor of 16, the elapsed time gains were 25-35% over the 
same range, indicating that the write traffic is an impor- 
tant factor. ! 


6 Future Work 


The biggest areas of future work are increasing the range 
of applications that use SLEDs, improving the accuracy 
of SLEDs both in mechanical details and dynamic sys- 
tem load, and the communication of SLEDs among com- 
ponents of the system (including between file servers 
and clients). The limitations discussed in section 3.4 


1A bug inthe SLEDs implementation currently limits the rebinning 
parameters which operate correctly. 
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need to be addressed. 


Devices can be characterized either externally or inter- 
nally. Hillyer and Sandsta did external device charac- 
terization on tape drives, and we have done so on zoned 
disk drives [Van97]. The devices or subsystems could be 
engineered to report their own performance characteris- 
tics. Cooperation of subsystem vendors will be required 
to report SLEDs to the file system. Without this data, 
building true QoS-capable storage systems out of com- 
plex components such as HP’s AutoRAID [WGSS95] 
will be difficult, whether done with or without SLEDs. 


Work has begun on a migrating hierarchical storage 
management system for Linux [Sch00O]. This will pro- 
vide an excellent platform for continued development of 
SLEDs. 


7 Conclusion 


This paper has shown that Storage Latency Estimation 
Descriptors, or SLEDs, provide significant improve- 
ments by allowing applications to take advantage of 
the dynamic state of the multiple levels of file system 
caching. Applications may report expected file retrieval 
time, prune I/Os to avoid expensive operations, or re- 
order I/Os to utilize cached data effectively. 


Reporting latency, as we have done with gmc, is useful 
for interactive applications to provide the user with more 
insight into the behavior of the system. This can be ex- 
pected to improve user satisfaction with system perfor- 
mance, as well as reduce the amount of time users ac- 
tually spend idle. When users are told how long it will 
take to retrieve needed data, they can decide whether or 
not to wait, or productively multitask while waiting. 


Pruning I/Os is especially important in heavily loaded 
systems, and for applications such as f ind that can im- 
pose heavy loads. This is useful for network file systems 
and hierarchical storage management systems, where re- 
trieval times may be high and impact on other users is a 
significant concern. Because the SLEDs interface is in- 
dependent of the file system and physical device struc- 
ture, users do not need to be aware of mount points, vol- 
ume managers, or HSM organization. Scripts and other 
utilities built around this concept will remain useful even 
as storage systems continue to evolve. 


Reordering I/Os has been shown, through a series of ex- 
periments on wc and grep, to provide improvement 


in execution time of from 50 percent to more than an 
order of magnitude for file sizes of one to three times 
the size of the file system buffer cache. Experiments 
showed an 11-25 percent reduction in elapsed time for 
fimhisto and fimgbin, members of a large suite 
of professional astronomical image processing software. 
SLEDs-enabled applications have more stable perfor- 
mance in this area as well, showing a gradual decline in 
performance compared to the sudden step phenomenon 
at just above the file system buffer cache size exhibited 
without SLEDs. These experiments were run on ext2, 
NFS and CD-ROM file systems; the effects are expected 
to be much more pronounced on hierarchical storage 
management systems. 
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Abstract 


In traditional file system implementations, the Least 
Recently Used (LRU) block replacement scheme is 
widely used to manage the buffer cache due to its sim- 
plicity and adaptability. However, the LRU scheme 
exhibits performance degradations because it does not 
make use of reference regularities such as sequential 
and looping references. In this paper, we present a 
Unitied Buffer Management (UBM) scheme that ex- 
ploits these regularities and yet, is simple to deploy. 
The UBM scheme automatically detects sequential and 
looping references and stores the detected blocks in sep- 
arate partitions of the buffer cache. These partitions are 
managed by appropriate replacement schemes based on 
their detected patterns. The allocation problem among 
the divided partitions is also tackled with the use of the 
notion of marginal gains. In both trace-driven simu- 
lation experiments and experimental studies using an 
actual implementation in the FreeBSD operating sys- 
tem, the performance gains obtained through the use of 
this scheme are substantial, The results show that the hit 
ratios improve by as much as 57.7% (with an average of 
29.2%) and the elapsed times are reduced by as much 
as 67.2% (with an average of 28.7%) compared to the 
LRU scheme for the workloads we used. 
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1 Introduction 


Efficient management of the buffer cache by using 
an effective block replacement scheme is important for 
improving file system performance when the size of the 
buffer cache is limited. To this end, various block re- 
placement schemes have been studied [1, 2, 3, 4, 5, 6). 
Yet, the Least Recently Used (LRU) block replacement 
scheme is still widely used due to its simplicity. While 
simple, it adapts very well to the changes of the work- 
load, and has been shown to be effective when recently 
referenced blocks are likely to be re-referenced in the 
near future [7]. A main drawback of the LRU scheme, 
however, is that it cannot exploit regularities in block 
accesses such as sequential and looping references and 
thus, yields degraded performance (3, 8, 9]. In this pa- 
per, we present a Unified Buffer Management (UBM) 
scheme that exploits these regularities and yet, is sim- 
ple to deploy. The performance gains are shown to be 
substantial. Trace-driven simulation experiments show 
that the hit ratios improve by as much as 57.7% (with 
an average of 29.2%) compared to the LRU scheme for 
the traces we considered. Experimental studies using an 
actual implementation of this scheme in the FreeBSD 
operating system show that the elapsed time is reduced 
by as much as 67.2% (with an average of 28.7%) com- 
pared to the LRU scheme for the applications we used. 


1.1 Motivation 


The graphs in Figure 1 show the motivation behind this 
study. First, Figure 1(a) shows the space-time graph 
of block references from three applications, namely, 
cscope, cpp, and postgres (details of which will be 
discussed in Section 4), executing concurrently. The 
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Figure 1: Caching behaviors of the OPT block replacement scheme and the LRU block replacement scheme. 
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a-axis is the virtual time which ticks at each block ref- 
erence and the y-axis is the logical block number of the 
block referenced at the given time. From this graph, 
we can easily notice sequential and looping reference 
regularities throughout their execution. 


Now consider the graphs in Figures 1(b) and l(c). 
They show the Inter-Reference Gap (IRG) distributions 
of blocks that hit in the buffer cache for the off-line 
optimal (OPT) block replacement scheme and the LRU 
block replacement scheme, respectively, as the cache 
size increases from 100 blocks to 1000 blocks. The x- 
axis is the IRG and the y-axis is the total hit count with 
the given IRG. 


Observe from the corresponding graphs of the two 
figures the difference with which the two replacement 
schemes behave. The main difference comes from how 
looping references (that is, blocks that are accessed re- 
peatedly with a regular reference interval, which we re- 
fer to as the loop period) are treated. The OPT scheme 
relains the blocks in the increasing order of loop peri- 
ods as the cache size increases since the scheme chooses 
a victim block according to the forward distance (i.e., 
difference between the time of the next reference in the 
future and the current time). From Figure 1(b), we can 
see that in the OPT scheme the hit counts of blocks 
with IRG between 70 and 90 increase gradually as the 
cache size increases. On the other hand, in the LRU 
scheme there are no buffer hits at this range of IRGs 
even when the buffer cache has 1000 blocks. This re- 
sults from blocks at these IRGs being replaced either by 
blocks that are sequentially referenced (and thus never 
re-referenced) or by those with larger IRGs (and thus 
are replaced before being re-referenced). Although pre- 
dictable, the regularities of sequential and looping ref- 
erences are not exploited by the LRU scheme, which 
leads to significantly degraded performance. 


From this observation, we devise a new buffer man- 
agement scheme called the Unified Buffer Management 
(UBM) scheme. The UBM scheme exploits regulari- 
ties in reference patterns such as sequential and loop- 
ing references. Evaluation of the UBM scheme using 
both trace-driven simulations andanactual implementa- 
tion in the FreeBSD operating system shows that 1) the 
UBM scheme is very effective in detecting sequential 
and looping references, 2) the UBM scheme manages 
sequentially-referenced and looping-referenced blocks 
similarly to the OPT scheme, and 3) the UBM scheme 
shows substantial performance improvements. 


1.2 The Remainder of the Paper 


The remainder of this paper is organized as follows. 
In the next section, we review related work. In Section 
3, we explain the UBM scheme in detail. In Section 4, 
we describe our experimental environments and com- 
pare the performance of the UBM scheme with those 
of previous schemes through trace-driven simulations. 
In Section 5, animplementation of the UBM scheme in 
the FreeBSD operating system is evaluated. Finally, we 
provide conclusions and directions for future research 
in Section 6. 


2 Related Work 


In this section, we place previous page/block replace- 
ment schemes into the following three groups and in 
turn, survey the schemes in each group. 


e Replacement schemes based on frequency and/or 
recency factors. 


e Replacement schemes based on user-level hints. 


e Replacement schemes making use of regularities 
of references such as sequential references and 
looping references. 


The FBR (Frequency-based Replacement) scheme by 
Robinson and Devarakonda [1], the LRU-K scheme by 
O'Neil er al. (2), the IRG (Inter-Reference Gap) scheme 
by Phalke and Gopinath [5], and the LRFU (Least Re- 
cently/Frequently Used) scheme by Lee et al. [6] fall 
into the first group. The FBR scheme chooses a vic- 
tim block to be replaced based on the frequency factor 
differing from the Least Frequently Used (LFU) mainly 
in that it considers correlations among references, The 
LRU-K scheme bases its replacement decision on the 
blocks’ kth-to-last reference, while the IRG scheme’s 
decision is based on the inter-reference gap factor. The 
LRFU schemeconsiders both the recency and frequency 
factors of blocks. These schemes, however, show lim- 
ited performance improvements because they do not 
consider regular references such as sequential and loop- 
ing references. 


Application-controlled file caching by Cao et al. [4] 
and informed prefetching and caching by Patterson et 
al. [10] are schemes based on user-level hints. These 
schemes choose a victim block to be replaced based 
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on user-provided hints on application reference charac- 
teristics, allowing different replacement policies to be 
applied to different applications. However, to obtain 
user-level hints, users need to accurately understand 
the characteristics of block reference patterns of appli- 
cations. This requires considerable effort from users 
limiting the applicability. 


The third group of schemes considers regularities of 
references, and the 2Q scheme by Johnson and Shasha 
[3], the SEQ scheme by Glass and Cao [8], and the 
EELRU (Early Eviction LRU) scheme by Smaragdakis 
etal. [9] fall into this group. The 2Q scheme quickly 
removes from the buffer cache sequentially-referenced 
blocks and looping-referenced blocks with long loop 
periods. This is done by using a special buffer called 
the Alzn queue in which all missed blocks are initially 
placed and from which the blocks are replaced in the 
FIFO order after short residence. On the other hand, 
the scheme holds looping-referenced blocks with short 
loop periods in the main buffer cache by using a ghost 
buffer called the A lout queue in which the addresses of 
blocks replaced from the Alin queue are temporarily 
placed to discriminate between frequently referenced 
blocks and infrequently referenced ones. When a block 
is re-referenced while its address is in the Alout queue, 
it is promoted to the main buffer cache. The 2Q scheme, 
however, has two drawbacks. One is that an additional 
miss has to occur for a block to be promoted to the main 
buffer cache from the Alout queue. The other is that 
a careful tuning is required for two control parameters, 
that is, the size of the Alin queue and the size of the 
Alout queue, which may be sensitive to the types of 
workload. 


The SEQ schemedetects long sequences of page faults 
and applies the Most Recently Used (MRU) scheme 
to those pages. However, in determining the victim 
page, it does not distinguish sequential and looping ref- 
erences. The EELRU scheme confirms the existence 
of looping references by examining aggregate recency 
distributions of referenced pages and changes the page 
eviction points using a simple on-line cost/benefit analy- 
sis. The EELRU scheme, however, does not distinguish 
between looping references with different loop periods. 


3 The Unified Buffer 
Scheme 


Management 


The Unified Buffer Management (UBM) scheme is 
composed of the following three main modules. 
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Detection This module automatically detects sequen- 
tial and looping references. After the detection, 
block references are classified into sequential, 
looping, or other references. 


Replacement This module applies different replace- 
ment schemes to the blocks belonging to the three 
reference patterns according to the properties of 
each pattern. 


Allocation This module allocates the limited buffer 
cache space among the three partitions corre- 
sponding to sequential, looping, and other refer- 
ences. 


In the following subsections, we give a detailed expla- 
nation of each of these modules. 


3.1 Detection of Sequential and Looping Ref- 
erences 


The UBM scheme automatically detects sequential, 
looping, and other references according to the following 
rules: 


Sequential references that are consecutive block ref- 
erences occurring only once. 


Looping references that are sequential references oc- 
curring repeatedly with a regular interval. 


Other references that are detected neither as sequen- 
tial nor as looping references. 


Figure 2 shows the classification process of the UBM 
scheme. Note that looping references are initially de- 
tected as sequential until they are re-referenced. 


For on-line detection of sequential and looping refer- 
ences, information about references to blocks in each 
file is maintained in an abstract form. The elements 
needed are shown in Figure 3. 


Information for each file is maintained as a 4-tuple 
consisting of a file descriptor (fileI D), a start block 
number (start), an end block number (end), and a loop 
period (period). A reference is categorized as a se- 
quential reference after a given number of consecutive 
references are made. For a sequential reference the loop 
period is oo, while for a looping reference its value is 
the actual loop period. In real systems, the loop pe- 
riod fluctuates by various factors including the degree 
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Figure 2: Classification process of the UBM scheme. 


of multiprogramming and scheduling. Hence, this value 
is set as an exponential average of measured loop peri- 
ods. Also, since a looping reference is initially detected 
as a sequential reference, its blocks are managed just 
like those belonging to a sequential reference until they 
are re-referenced. This may make them miss the first 
time they are re-referenced if there is not enough space 
in the cache (cf. Section 4.7). 


The resulting table keeps information for sequences 
of consecutive block references that are detected up 
to the current time and is updated whenever a block 
reference occurs. In most UNIX file systems, sequences 
of consecutive block references are detected by using 
vnode numbers and consecutive block addresses. 


Figure 4 shows an example of sequential and looping 
references, and the data structure that is used to maintain 
information for these references. In the figure, the file 
with fileID 3 is a sequential reference as it has oo as 
its loop period. Files with fileID 1 and 2 are looping 
references with loop periods of 80 and 40, respectively. 


\ 


3.2 Block Replacement Schemes 


The detection mechanism results in files being catego- 
rized into three types, that is, sequential, looping, and 
other references. The buffer cache is divided into three 


USENIX Association 





Block 
Numbe 


(tilelD,start,end,period) 


1 
(1,0,10,00 ) 
0 


(1,0,10,60) 
4 -@ta 





Time 


Figure 3: Elements for representing sequential and loop- 
ing references. 
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Figure 5: Typical curves of buffer hits per unit time and 
marginal gain values. 


partitions to accommodate the three different types of 
teferences. Management of the partitions must be done 
according to the reference characteristics of blocks be- 
longing to each partition. 


For the partition that holds sequential references, it is a 
simpie matter. Sequentially -referenced blocks are never 
re-referenced. Hence, the referenced blocks need not be 
retained and therefore, the MRU replacement policy is 
used. 


For the partition that holds looping references, victim 
blocks to be replaced are chosen based on their loop 
periods because their re-reference probabilities depend 
on these periods. To do so, we use a period-based 
replacement scheme that replaces blocks in decreasing 
order of loop periods, and the MRU block replacement 
scheme among blocks with the same loop period. 


Finally, within the partition that holds other references, 
victim blocks to be replaced can be chosen based on 
their recency, frequency, or a combination of the two 
factors. Hence, we may use any of previously proposed 
replacement schemes including the Least Recently Used 
(LRU), the Least Frequently Used (LFU), the LRU-K, 
and the Least Recently/Frequently Used (LRFU) as long 
as they have a model that approximates the hit ratio for 
a given buffer size to compute the marginal gain, which 
will be explained in the next subsection. In this paper, 
we assume the LRU replacement scheme. 


3.33 Buffer Allocation Based on Marginal Gain 


The buffer cache has now been divided into three par- 
titions that are being managed separately. An important 
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problem that should be addressed then is how to allocate 
the blocks in the cache among the three partitions. To 
this end, we use the notion of marginal gains, which has 
frequently been used in resource allocation strategies in 
various computer systems areas [10, 11, 12, 13]. 


Marginal gain is defined as MG(n) = Hit(n) — 
Hit(n — 1), which specifies the expected number of 
extra buffer hits per unit time that would be obtained by 
increasing the number of allocated buffers from (n-1) 
ton, where Hit(n) is the expected number of buffer 
hits per unit time using n buffers. In the following, we 
explain how to predict the marginal gains of sequen- 
tial, looping, and other references, as the buffer cache 
is partitioned to accommodate each of these types of 
references. 


The expected number of buffer hits per unit time of se- 
quential references when using 7 buffers is H itseq(n) = 
O, and thus, the expected marginal gain is always 
MG eq(n) = 0. 


For looping references, the expected number of buffer 
hits per unit time and the expected marginal gain value 
are calculated as follows. 


First, for a looping reference, loop;, with loop length 
[; and loop period p;, the expected number of buffer hits 
per unit time when using n buffers is H ittoop;(n) = 
min{l;,n] /pi. Thus, ifn < l;, the expected marginal 
gain is MGioop,(n) = n/p; — (n — 1)/p; = 1 /p; and 
ifn > 1;, MG wop; (n) — L/D: > Li /pi = 0. 


Now, assume that there are L looping references 
{loop,, ..., loop;,...,loop,}, where the loops here are 
arranged in the increasing order of loop periods. Let 
tmaz bethe maximum of? such thatm = Seek lk <n, 
where n is the number of buffers in the partition for loop- 
ing references. Iftmaz = L, thenall loops can be held in 
the buffer cache, and hence H itioep(n) = emt 1 le /Dks 
and MGioop(n) = 0. Consider now, the more com- 
plicated case where tmaz < L. The expected number 
of buffer hits per unit time of these looping references 
when using n buffers is Hitjoop(n) = Doy24" le /Dk + 
minliaet1) 7 = ye * lk] /Dimaeti- (Recall that we 
are using the period- based replacement scheme to man- 
age the partition for looping references. Hence, there 
canbe no loops within this partition that has loop period 
longer than p;,,,,+1-) Hence, the expected marginal 
gain is MGioop(”) = 1/Pinaet1> 


Finally, for the partition that holds the other references 
and which is managed by the LRU replacement scheme, 
the expected number of buffer hits per unit time and the 
expected marginal gain value can be calculated from the 
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buffer hit ratio using ghost buffers [11, 10] and/or the 
Belady’s lifetime function [14]. Ghost buffers, some- 
times called dataless buffers, are used to estimate the 
number of buffer hits per unit time for cache sizes larger 
than the current size when the cache is managed by 
the LRU replacement scheme. A ghost buffer does not 
contain any data blocks but maintains control informa- 
tion needed to count cache hits. The prediction of the 
buffer hit ratio using only ghost buffers is impractical 
due to the overhead of measuring the hit counts of all 
LRU stack positions individually. In the UBM scheme, 
we use an approximation method suggested by Choi er 
al. [13]. The proposed method utilizes the Belady’s 
lifetime function, which is well-known to approximate 
the buffer hit ratio of references that follow the LRU 
model. Specifically, the hit ratio with n buffers is given 
by Belady’s lifetime function as 


hitother(n) =hit heathy t+...thnel—cxn* 


where c and k are control parameters. Specific c and 
k values can be calculated on-line by using measured 
buffer hit ratios at pre-set cache sizes. Ghost buffers are 
used to determine the hit ratios at these pre-set cache 
sizes. The overhead of using ghost buffers in this case is 
minimal as accurate LRU stack positions of referenced 
blocks need not be located. For example, to calculate the 
values of the contro] parameters, cand k, buffer hitratios 
ata minimum of two cache sizes, say, pand q, where p # 
q are required. Using these values and the equation of 
the lifetime function, we can calculate the values of cand 
k. Then, the expected number of buffer hits per unit time 
is given by Hitother(n) = hitother(m) x Father where 
Nother ANd Nigtat are the number of other references and 
the number of total references, respectively, during the 
observed period. Finally, the expected marginal gain is 
simply MGother() = Hitotner(n) = Hitother(n as Ly. 


Figure 5 shows typical curves of both buffer hits 
per unit time and marginal gain values of sequential, 
looping, and other references as the number of allo- 
cated buffers increases. In the UBM scheme, since the 
marginal gain of sequential references, MGseq(7), is 
always zero, the buffer manager does not allocate more 
than one buffer to the corresponding partition, except 
when buffers are not fully utilized. That is, only when 
there are free buffers at the initial stage of allocation, 
more than one buffer may be allocated to this partition. 
Thus, in general, buffer allocation is determined be- 
tween the partitions that hold the looping-referenced 
blocks and the other-referenced blocks. The UBM 
scheme tries to maximize the expected number of to- 
tal buffer hits by dynamically controlling the allocation 
so that the marginal gain value of looping references, 
MGioop(n), and the marginal gain value of other ref- 


erences, MGotner(C — 1), where C is the cache size, 
converge to the same value. 


3.4 Interaction Among the Three Modules 


Figure 6 shows the overall interactions between the 
three modules of the UBM scheme. Whenever a block 
reference occurs in the buffer cache, the detection mod- 
ule updates and/or classifies the reference into one of 
the sequential, looping, or other reference types (step 
(1) in Figure 6). In this example, assume that a miss 
has occurred and the reference is classified as an other 
reference. As a miss has occurred the buffer allocation 
module is called to get additional buffer space for the 
referenced block (step (2)). The buffer allocation mod- 
ule would normally compare the marginal gain values 
of looping and other references and choose the one with 
a smaller marginal gain value and send a replacement 
request to the corresponding cache partition as shown 
in step (3). However, if there is space allocated to a 
sequential reference, such space is always deallocated 
first. The cache management module of the selected 
cache partition decides a victim block to be replaced 
using its replacement scheme (step (4)) and deallocates 
the buffer space of the victim block to the allocation 
module (step (5)). The allocation module forwards this 
space to the cache partition for other-referenced blocks 
(step (6)). Finally, the referenced block is fetched from 
disk into the buffer space (step (7)). 


4 Performance Evaluation 


Inthis section and the next, we discuss the performance 
of the UBM scheme. This section concentrates on the 
simulation study, while the next section focuses on the 
implementation study. 


In this section, the performance of the UBM scheme 
is compared with those of the LRU, 2Q, SEQ, EELRU, 
and OPT schemes through trace-driven simulations!. 
We also compare the performance of the UBM scheme 
with that of application-controlled file caching through 
trace-driven simulations with the same multiple appli- 
cation trace used in [4]. We did not compare the per- 
formance of the UBM and those of schemes based on 
recency and/or frequency factors such as FBR, LRU-K, 


'Though the SEQ and EELRU schemes were originally 
proposed as page replacement schemes, they can also be used 
as block replacement schemes. 
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Figure 6: Overall structures of the UBM scheme. 


Table 1: Characteristics of the traces used. 


| Trace | Applications executed concurrently 


15858 
cscope, cpp, postgres 26311 5684 
cpp, gnuplot, glimpse, postgres 30241 7453 


and LRFU since the benefits from the two factors are 
largely orthogonal and any of the latter schemes can be 
used to manage other references in the UBM scheme. 
We first describe the experimental setup and thenpresent 
the performance results. 


4.1 Experimental Setup 


Traces used in our simulations were obtained by con- 
currently executing diverse applications on the FreeBSD 
operating system running on an Intel Pentium PC. The 
characteristics of the applications are described below. 


cpp Cpp is the GNU C compiler pre-processor. The 
total size of C sources used as input is roughly 
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# of references | # of unique blocks 













11MB. During execution, observed block refer- 
ences are sequential and other references. 


cscope Cscope is an interactive C source examination 
tool. The total size of C sources used as input is 
roughly 9MB. It exhibits looping references with 
an identical loop period and other references. 


glimpse Glimpse is a text information retrieval util- 
ity. The total size of text files used as input is 
roughly SOMB. Its block reference characteristic 
is diverse - it shows sequential references, looping 
references with different loop periods, and other 
references. 


gnuplot Gnuplot is an interactive plotting program. 
The size of raw data used as input is 8MB. Loop- 
ing references with an identical loop period and 
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Figure 7: Reference classification results (Trace: M ulti2). 


other references were observed during execution. 


postgres Postgres is a relational database system from 
the University of California at Berkeley. We used 
join queries among four relations, namely, twot- 
houstup, twentythoustup, hundredthoustup, and 
twohundredthoustup, which were made from a 
scaled-up Wisconsin benchmark. The sizes of 
each relation are approximately 15OKB, 1.5MB, 
7.5MB, and 15MB. It exhibits sequential refer- 
ences, looping references with different loop peri- 
ods, and other references. 


mpeg-_play Mpeg-play is an MPEG player from the 
University of California at Berkeley. The size 
of the MPEG video file used as input is SMB. 
Sequential references dominate inthis application. 


We used three multiple application traces in our ex- 
periments. They are denoted by Multi1, Multi2, and 
Multi3, and their characteristics are shown in Table 1. 


We built simulators for the LRU, 2Q, SEQ, EELRU, 
and OPT schemes as well as the UBM scheme. Unlike 
the UBM scheme that does not have any parameters that 
need to be tuned, the 2Q, SEQ, and EELRU schemes 
have one or more parameters whose settings may affect 
the performance. For example, in the 2Q scheme the 
parameters are the sizes of the Alin and Al out queues. 
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The parameters of the SEQ scheme are threshold val- 
ues used to choose victim blocks among consecutively 
missed blocks. Finally, for the EELRU scheme, the 
early and late eviction points from which a block is re- 
placed, have to be set. In our experiments, we used the 
values suggested by the authors of each of the schemes. 


4.2 Detection Results 


Figure 7 shows the classifications resulting from the 
detection module for the Multi2 trace. The z-axis 
is the virtual time and the y-axis is the logical block 
number. The figures are space-time graphs with Figure 
7(a) showing block references for the whole trace and 
Figure 7(b) showing how the detection module classified 
the sequential, looping, and other references from the 
original references. The results indicate that the UBM 
scheme accurately classifies the three reference patterns. 


4.3 IRG Distribution Comparison of the UBM 


Scheme with the OPT Scheme 


Figure 8 shows the caching behaviors of the OPT and 
UBM schemes for the Multi2 trace. Compare these 
results with those shown in Figures 1(b) and 1(c), which 
use the same trace. Recall that these graphs show the 
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Figure 8: Caching behaviors of the OPT and UBM schemes. 


hit counts in the buffer cache using IRG distributions 
of referenced blocks. We note that the UBM scheme is 
very closely mimicking the behavior of the OPT scheme. 


4.4 Performance Comparison of the UBM 
Scheme with Other Schemes 


Figure 9 shows the buffer hit ratios of the UBM scheme 
and other replacement schemes as a function of cache 
size with the block size set to 8KB. For most cases, 
the UBM scheme shows the best performance. Further 
analysis for each of the schemes is as follows. 
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The SEQ scheme shows fairly stable performance for 
all cache sizes. The reason behind its good performance 
is that it quickly replaces sequentially-referenced blocks 
that miss in the buffer cache. However, since thescheme 
does not consider looping references, it shows worse 
performance than the UBM scheme. 


The 2Q scheme shows better performance than the 
LRU scheme for most cases because it quickly replaces 
sequentially-referenced blocks and looping-referenced 
blocks with long loop periods. However, when the 
cache size is large (caches with 1800 or more blocks for 
the Multil trace, 2800 or more blocks for the Multi2 
trace, and 3600 or more blocks forthe Multi3 trace), 
the scheme shows worse performance than the LRU 
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scheme. There are two reasons behind this. First, since 
the scheme replaces all newly referenced blocks after 
holding it in the buffer cache for a short time, when- 
ever these blocks are re-referenced, additional misses 
occur. The ratio of these additional misses to total 
misses increases as the cache size increases resulting 
ina significant impact on the performance of the buffer 
cache. Second, the scheme does not distinguish between 
looping-referenced blocks with different loop periods. 
The performance of the 2Q scheme does not gradu- 
ally increase with the cache size, but rather surges be- 
yond some cache size and then holds steady. Also the 
2Q scheme exhibits rather anomalous behavior for the 
Multi2 and Multi3 traces. When the cache sizes are 
about 2200 blocks and 2800 blocks for Multi2 and 
Mult73, respectively, the buffer hit ratio of the scheme 
decreases as the cache size increases. A careful inspec- 
tion of results reveals that when the cache size increases 
the Alout queue size increases accordingly and this 
results in a situation where blocks that should not be 
promoted are promoted to the main buffer cache lead- 
ing to such an anomaly. 


The EELRU scheme shows similar or better perfor- 
mance than the LRU scheme as the cache size increases. 
However, since the scheme chooses a victim block to be 
replaced based on aggregate recency distributions of ref- 
erenced blocks, it does not replace quickly enough the 
sequentially-referenced blocks and looping-referenced 
blocks that have long loop periods. Further, like the 
2Q scheme, it does not distinguish between looping- 
referenced blocks with different loop periods. Hence, it 
does not fair well compared to the UBM scheme. 


The LRU scheme shows the worst performance for 
most cases because it does not give any consideration to 
the regularities of sequential and looping references. 


Finally, the UBM scheme replaces sequentially- 
referenced blocks quickly and holds the looping- 
referenced blocks in increasing order of loop periods 
based on the notion of marginal gains as the cache size 
increases. Consequently, the UBM scheme improves 
the buffer hit ratios by up to 57.7% (for the Multi1 
trace with 1400 blocks) compared to the LRU scheme 
with an average increase of 29.2%. 


4.5 Results of Dynamic Buffer Allocation 


Figure 10(a) shows the distribution of the buffers allo- 
cated to the partitions that hold sequential, looping, and 
other references as a function of (virtual) time when the 
buffer cache size is 1000 blocks. Until time 2000, the 
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Figure 9: Performance comparison of the UBM scheme 
with other schemes. 
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Figure 10: Results of dynamic buffer allocation (cache size: 1000 blocks, trace: Multi2). 


buffer cache is not fully utilized. Hence, buffers are 
allocated to any reference that requests it, resulting in 
the partition that holds sequentially-referenced blocks 
being allocated a considerable number of blocks. Af- 
ter time 2000, all the buffers have been consumed, and 
hence, the number of buffers allocated to the partition for 
sequentially-referenced blocks decreases, while alloca- 
tions to the partitions for looping and other references 
start to increase. As a result, at around time 6000, only 
one buffer is allocated to the partition for sequentially - 
referenced blocks. From about time 10000, the alloca- 
tions to the partitions for the looping and other refer- 
ences converge to a steady-state value. 


Figure 10(b) shows marginal gains as a function of al- 
located buffers of looping and other references that are 
calculated at time 20000 of Figure 10(a). Since there 
are several looping references with different loop peri- 
ods in the M ult22 trace, from the left figure, we can 
see that the expected marginal gain values of the loop- 
ing references, MGioop(n), decrease step-wise as the 
number of allocated buffers, n, increases. The figure on 
the right shows the expected marginal gain values of the 
other references, MG ine,(n), that decrease gradually. 
The UBM scheme dynamically controls the number of 
allocated buffers to each partition so that the marginal 
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gain values of the two partitions converge to the same 
value. At time 20000, the two marginal gain values con- 
verge to 0.000222 and the number of allocated buffers 
to the partitions for the looping and other references is 
731 blocks and 268 blocks, respectively. 


4.6 Performance Comparison of the UBM 
Scheme with Application-controlled File 
Caching 


The application-controlled file caching (ACFC) 
scheme [4] is a user-hint based approach to buffer cache 
management, which is in contrast to the UBM scheme 
that requires no such hints. To compare the performance 
of these two schemes, we used the ULTRIX multiple ap- 
plication (postgres + cscope+linking the kernel) trace 
in [4]. 


Figure 11 shows the buffer hit ratios of the 
UBM scheme, two ACFC schemes, namely, 
ACFC(HEURISTIC) and ACFC(RMIN), and _ the 
LRU, 2Q, SEQ, EELRU, and OPT schemes when 
the cache size increases from 4M to 16M. The 
ACFC(HEURISTIC) scheme uses user-level hints for 
each application, while the ACFC(RMIN) scheme uses 
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Figure 11: Performance comparison of the UBM 


scheme with the application-controlled file caching 
scheme. 


the optimal off-line strategy for each application. The 
results for the two ACFC schemes were borrowed from 
[4] while the results for all the other schemes were 
obtained from simulations. The results show that the 
hit ratios of the UBM scheme, which does not make 
use of any external hints, are comparable to those of 
the ACFC(RMIN) scheme and higher than those of the 
ACFC(HEURISTIC) scheme. 


4.7 Warm/Cold Caching Effects 


All experiments so far were done with cold caches. 
To evaluate the performance of the UBM scheme at 
steady-state, we performed additional simulations with 
warmed-up caches. In the experiments, we run initially 
a long-run workload (sdet_benchmark?) through the 
buffer cache and after the cache was warmed up, we 
run both the long-run workload and the target work- 
load (cscope + cpp + postgres) concurrently. The 
cache statistics were collected only after the cache was 
warmed up. 


Experimental results that show the warm/cold caching 
effects of the UBM scheme are given in Figure 12 and 
Table 2. The graphs of Figure 12 show that when the 
cache size is small, the performance improvements by 
the UBM scheme with warmed-up caches over the LRU 
scheme are similar to those with cold caches. As the 
cache size increases, however, the performance increase 
by the UBM scheme with warmed-upcaches is reduced 


2Sdet is the SPEC SDET benchmark that simulates a mul- 
tiprogramming environment. 


when compared with the performance increase with cold 
caches since sequential references are not allowed to be 
cached at all. Specifically, when the cache is warmed 
up, blocks belonging to sequential references are not 
allowed to be cached. If those blocks are re-referenced 
with a regular interval in the future (ie., if they turn 
into looping references), the UBM scheme has to reread 
them from the disks. In cold caches, however, many 
of them are reread from the cache partition that holds 
sequentially -referenced blocks because when the cache 
is cold, more than one block can be allocated to the 
partition for the sequentially referenced blocks. 


The resulting performance degradation, however, is 
not significant as we can see from Table 2 that sum- 
marizes the performance improvements by the UBM 
scheme for both cold caches and warmed-up caches - 
the difference in the average performance improvement 
between cold caches and warmed-up caches is less than 
1%. Although the overall performance degradation is 
negligible, the additional misses may adversely affect 
the performance perceived by the user due to increased 
start-up time. As future work, we plan to explore an 
allocation scheme where blocks are allocated even for 
sequential references based on the probability that a se- 
quential reference will turn into a looping reference. 


5 Implementation of UBM in the FreeBSD 
Operating System 


The UBM scheme was integrated into the FreeBSD op- 
erating system running on a 133MHz Intel Pentium PC 
with 128MB RAM and a 1.6GB Quantum Fireball hard 
disk. For the experiments, we used five real applications 
(cpp, cscope, glimpse, postgres, and mpeg-play) that 
were explained in Section 4. We ran several combina- 
tions of three or more of these applications concurrently 
and measured the elapsed time ofeach application under 
the UBM and SEQ schemes as well as under the built-in 
LRU scheme when the cache sizes are 8MB, 12MB, and 
16MB with the block size set to 8KB. 


5.1 Performance Measurements 


Figure 13 shows the elapsed time of each individual 
application under the UBM, SEQ andLRU schemes. As 
expected, the UBM scheme shows better performance 
than the LRU scheme. In the figure, since the postgres 
and cscope applications access large files repeatedly 
with a regular interval, they show better improvement 
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Overall, the UBM scheme reduces the elapsed time by 
up to 67.2% (the elapsed time of the postgres applica- 
tion for the cpp+ postgres + cscope +mpeg-_play case ok 
with 16MB cache size) compared to the LRU scheme yee Scher Scheme 
with an average improvement of 28.7%. We note that 
improvements by the UBM scheme on the elapsed time 
are comparable to those on the buffer hit ratios we ob- 
served in the previous section. 


Figure 13: Performance of the UBM scheme integrated 
into the FreeBSD. 


There are two main benefits from using the UBM 
scheme. The first is from detecting looping references, 
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Table 2: Comparison of performance improvements of 
the UBM scheme compared to the LRU scheme (Trace: 
cscope + cpp + postgres + sdet). 


Improvements 
with warmed-up caches 
avg. 18.9% (max. 22.6%) 


Improvements 
with cold caches 





avg. 19.6% (max. 26.2%) 


managing them by a period-based replacement policy, 
and allocating buffer space to them based on marginal 
gains. The second is from giving preference to blocks 
belonging to sequential references when a replacement 
is needed. To quantify these benefits, we compared the 
UBM scheme with the SEQ scheme. The results in Fig- 
ure 13 show that there is still a substantial difference 
in the elapsed time between the UBM scheme and the 
SEQ scheme indicating that the benefit from carefully 
handling looping references is significant. 


5.2 Effects of Run-Time Overhead of the UBM 
Scheme 


To measure the run-time overhead of the UBM 
scheme, we executed a CPU-bound application 
(cpu_bound) that executes an ADD instruction repeat- 
edly, along with the other applications. 


Figure 14 shows the run-time overhead of the UBM 
scheme for two combinations of applications, namely, 
cpu_bound+ cpp+ postgres +cscope and cpu-bound+ 
glimpse + cpp + postgres + cscope when the cache 
size is 12MB. The elapsed time of the cpu_bound appli- 
cation increases slightly by around 5% when using the 
UBM scheme compared with that when using the LRU 
scheme. A major source of this overhead comes from 
the operations to manipulate the ghost buffers and to 
calculate the marginal gain values. Currently, we inte- 
grated the UBM scheme into the FreeBSD in a straight- 
forward manner without any optimization, and hence we 
expect there is still much room for further optimizing 
the performance. 


The UBM scheme also has the space overhead of main- 
taining ghost buffers in the kernel memory. The maxi- 
mum size of the LRU stack to be maintained including 
ghost buffers is limited by the total number of buffers 
in the file system. Therefore, the space overhead due to 
ghost buffers is proportional to the difference between 
the total number of buffers in the system and the number 
of allocated buffers for other references. In the current 
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Figure 14: Run-time overhead of the UBM scheme. 


implementation, each ghost buffer requires 13 bytes. 


6 Conclusions and Future Work 


This paper starts from the observation that the widely 
used LRU replacement scheme does not make use of 
regularities present in the reference patterns of applica- 
tions, leading to degraded performance. The Unified 
Buffer Management (UBM) scheme is proposed to re- 
solve this problem. The UBM scheme automatically 
detects sequential and looping references and stores the 
detected blocks in separate partitions of the buffer cache. 
These partitions are managed by appropriate replace- 
ment schemes based on the properties of their detected 
patterns. The allocation problem among the partitions 
is also tackled with the use of the notion of marginal 
gains. 


To evaluate the performance of the UBM scheme, ex- 
periments were conducted using both trace-driven sim- 
ulations with multiple application traces and an imple- 
mentation of the scheme in the FreeBSD operating sys- 
tem. Both simulation and implementation results show 
that 1) the UBM scheme accurately detects almost all the 
sequential and looping references, 2) the UBM scheme 
manages sequential and looping-referenced blocks sim- 
ilarly to the OPT scheme, and 3) the UBM scheme 
shows substantial performance improvements increas- 
ing the buffer hit ratio by up to 57.7% (with an average 
increase of 29.2%) and reducing, in an actual implemen- 
tation in the FreeBSD operating system, the elapsed time 
by up to 67.2% (with an average of 28.7%) compared 
to the LRU scheme, for the workloads we considered. 


As future research, we are attempting to apply to other 
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references the Least Recently/Frequently Used (LRFU) 
scheme based on both recency and frequency factors 
rather than the LRU scheme, which is based on the 
recency factor only. Also, as automatic detection of 
sequential and looping references is possible, we are 
investigating the possibility of further enhancing per- 
formance through prefetching techniques that exploit 
these regularities as was attempted in [10] for informed 
prefetching and caching. Finally, we plan to extend 
the techniques presented in this paper to systems that 
integrate virtual memory and file cache management. 
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Abstract 


Some emerging applications require programs to main- 
tain sensitive state on untrusted hosts. This paper pre- 
sents the architecture and implementation of a trusted 
database system, TDB, which leverages a small amount 
of trusted storage to protect a scalable amount of un- 
trusted storage. The database is encrypted and validated 
against a collision-resistant hash kept in trusted storage, 
so untrusted programs cannot read the database or mod- 
ify it undetectably. TDB integrates encryption and hash- 
ing with a low-level data model, which protects data 
and metadata uniformly, unlike systems built on top of a 
conventional database system. The implementation ex- 
ploits synergies between hashing and log-structured 
storage. Preliminary performance results show that 
TDB outperforms an off-the-shelf embedded database 
system, thus supporting the suitability of the TDB archi- 
tecture. 


1 Introduction 


Some emerging applications require trusted programs to 
run on untrusted hosts. For example, vendors of digital 
goods such as software and music need to control the 
use of their goods according to their contracts with the 
consumers. The contracts may be enforced by executing 
a trusted program on the consumer’s computer or play- 
ing device [SBV95, IBMOO, Xer00]. 


Often, trusted programs need to maintain some sensi- 
tive, persistent state. For example, under a pay-per-use 
contract, the program may verify and debit the con- 
sumer’s account. Or, under a limited-use trial, the pro- 
gram may count and limit the number of times the good 
is used. The amount of such state may grow with the 
number of vendors, goods, and the types of contracts. 
Furthermore, the sensitive nature of the state makes it 
desirable to protect it from both tampering and acciden- 
tal corruption. Therefore, the state should be stored in a 
scalable and trusted database system. 


Although a trusted program runs on the client, it could 
maintain its database on a trusted server for best secu- 
rity. However, this may require frequent communication 


between the trusted program and the server, which is 
constraining for devices with poor connectivity. Ideally, 
consumers should be able to use goods distributed on 
mass media or previously hoarded on their devices, 
even when they are disconnected from the network. 
Therefore, it is desirable to maintain the database on the 
client side. 


The party hosting the database storage has the opportu- 
nity to alter its state for unauthorized benefits. For ex- 
ample, a consumer could save a copy of the local data- 
base, purchase some goods, then replay the saved copy, 
thus eliminating payments for the purchased goods. 


It is difficult to secure a trusted program and its data- 
base because the hosting party ultimately controls the 
underlying hardware and the operating system. How- 
ever, a number of emerging trusted platforms provide a 
processing environment that runs only trusted programs 
and resists reverse engineering and tampering. Such 
platforms employ a hardware package containing a 
processor, memory, and tamper-detecting circuitry 
([SPW98, KK99, Wav99, Dal00], or various techniques 
for software protection [Coh93, Auc96, CTL98]. How- 
ever, these platforms do not provide trusted persistent 
storage in bulk because it is difficult to prevent read and 
write access to devices such as disk and flash memory 
from outside the trusted platform. 


This paper presents the architecture and implementation 
of a trusted database system, TDB. By “trust” we mean 
Secrecy (protection against reading from untrusted pro- 
grams) and tamper detection (protection against writing 
from untrusted programs). An untrusted program cannot 
be prevented from tampering with the data, but such 
data fails validation when a trusted program reads it. 
This enables the trusted program to reject the data and 
perhaps refuse further operation. 


TDB may also be used to protect a database stored at an 
untrusted server. Such a database may be used by client 
devices that do not have enough local storage. In this 
case, the user may have no incentive to tamper with the 
client device, so no explicit mechanisms may be re- 
quired to provide a trusted platform on the client. 
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1.1 Basic Trust Management 


TDB leverages a trusted processing environment and a 
small amount of trusted storage available on the plat- 
form. It provides secrecy by encrypting data with a key 
hidden in secret storage. It provides tamper detection by 
leveraging a small amount of tamper-resistant storage, 
as described below. 


A common mechanism for validating data is to sign it 
with a secret key. However, signed data is vulnerable to 
replay attacks. The attack is easy because it does not 
require understanding the data; it works even when the 
data is encrypted. TDB resists replay attack by storing a 
collision-resistant hash of the database in tamper- 
resistant storage [MOV96]. When a trusted program 
writes and reads database objects, TDB updates and 
validates the database hash efficiently by maintaining a 
tree of hash values over the objects, as suggested by 
Merkle [Mer80]. 


TDB provides an option to use a tamper-resistant 
counter, which cannot be decremented, in place of ge- 
neric tamper-resistant storage. After each database up- 
date, TDB increments the counter and generates a cer- 
tificate containing the counter value and the database 
hash. The certificate is signed with the secret key and 
stored in untrusted storage. 


1.2 Storage Management 


To protect the state from accidental corruption, TDB 
provides standard database-system services such as 
crash atomicity, concurrent transactions, type checking, 
pickling, cache management, and index maintenance. 


One might consider building a trusted database system 
by layering cryptography on top of a conventional data- 
base system. This layer could encrypt objects before 
storing them in the database and maintain a tree of hash 
values over them. This architecture is attractive because 
it does not require building a new database system. Un- 
fortunately, the layer would not protect the metadata 
inside the database system. An attack could effectively 
delete an object by modifying the indexes. There could 
be some performance problems as well. For example, 
the database system could not maintain ordered indexes 
for range queries on encrypted data. 


For these reasons, TDB applies hashing and encryption 
to a low-level data model, which protects data and 
metadata uniformly. It also enables TDB to maintain 
ordered indexes on data. 


To protect the sensitive state from media failures such 
as disk crashes, TDB provides the ability to create 
backups and to restore valid backups. An attacker might 
fake a media failure and restore a backup to rollback the 


state. To limit the extent of a rollback, it is desirable to 
make frequent backups and disallow restoring old back- 
ups. TDB facilitates this by providing incremental 
backups [HMF99]. 


We discovered and exploited the synergy between the 
functions mentioned above and log-structured storage 
systems [RO91]. Log-structured systems have a com- 
prehensive and hierarchical location map, because all 
objects are relocatable. Embedding the hash tree in the 
location map allows an object to be validated as it is 
located. The checkpointing optimization defers and 
consolidates the propagation of hash values up the tree. 
Copy-on-write using the location map provides cheap 
snapshots, which enables incremental backups. Fur- 
thermore, the absence of fixed object locations makes it 
hard to link multiple updates to the same object, thus 
resisting some traffic-monitoring attacks. 


Preliminary performance results show that TDB outper- 
forms a system that layers cryptography on top of an 
off-the-shelf database system. The database overhead is 
dominated by I/O; encryption and hashing represent 
only 6% of the total overhead. 


1.3 Outline 


The rest of this paper is organized as follows. Section 2 
specifies the infrastructure TDB requires and the ser- 
vice it provides. Section 3 describes the overall archi- 
tecture of TDB. Sections 4 and 5 describe the integra- 
tion of encryption and hashing in a low-level data 
model. Section 6 describes backup creation and re- 
stores. Sections 7 and 8 briefly describe the construc- 
tion of database functions over the low-level data 
model. Section 9 gives preliminary performance results. 
Section 10 describes potential extensions to TDB. Sec- 
tion 11 compares TDB with related work. Section 12 
draws some conclusions. 


2 System Specification 


This section specifies the infrastructure TDB requires 
and the service it provides to applications. 


2.1 Required Infrastructure 


TDB requires a trusted platform that provides the fol- 

lowing, as shown in Figure 1: 

e Trusted processing environment, which executes only 
trusted programs and protects the volatile state of an 
executing program from being read or modified by 
untrusted programs. The static image of a trusted 
program need not be secret. 
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e Secret store: a small amount (e.g., 16 bytes) of read- 
only persistent storage that can be read only by a 
trusted program. 

© Tamper-resistant store: a small amount (e.g., 16 
bytes) of writable persistent storage that can be writ- 
ten only by a trusted program. Alternatively, the tam- 
per-resistant store may be a counter that cannot be 
decremented. In either case, we assume that the tam- 
per-resistant store can be updated atomically with re- 
spect to crashes. 


Authorized program 


Unauthorized program 


Trusted platform 






Processing environment 
Volatile Untrusted 
memory store 





Figure 1: The trusted platform 


The trusted platform may be a hardware package such 
as the IBM secure cooprocessor [SPW98], which con- 
tains a processor, battery-backed SRAM, DRAM, and 
ROM. The ROM firmware loads only trusted programs 
using a hash supplied during the manufacturing process. 
The battery-backed SRAM is zeroed if tampering is 
detected, so it can serve as both secret and tamper- 
resistant store. 


The infrastructure also provides an untrusted store to 
hold the database. It is persistent, allows efficient ran- 
dom access, and can be read and written by any pro- 
gram. This might be a disk, flash memory, or an un- 
trusted storage server connected to the trusted platform. 


An archival store is needed to recover from the failures 
of the untrusted store. It is also untrusted. It need not 
provide efficient random access to data, only input and 
output streams. It might be a tape or an ftp server. We 
assume its failures are independent of the untrusted 
store. 


We assume that suitable steps are taken when tampering 
is detected. The exact nature of such steps is outside the 
scope of this paper. 


2.2. Service Provided 
We list the functions of TDB below. 


Trusted storage: TDB provides tamper-detection and 
secrecy for bulk data. This includes resistance to replay 
attacks and attacks on metadata. 


Partitions: An application may need to protect different 
types of data differently. For example, it may have no 
need to encrypt some data or to validate other data. 
TDB allows an application to create multiple logical 
partitions, each protecting data with its own crypto- 
graphic parameters: 

© a secret key 

e acipher (an encryption algorithm), e.g., 3DES 

e acollision-resistant hash function, e.g., SHA-1 


Using appropriate parameters avoids unnecessary time 
and space overhead. Using different secret keys reduces 
the loss from the disclosure of a single key. This should 
not be confused with access control among trusted par- 
ties, which may be provided in a higher layer, if needed. 


Atomic updates: TDB can update multiple pieces of 
data atomically with respect to fail-stop crashes such as 
power failures. 


Backups: TDB can back up a consistent snapshot of a 
set of partitions and restore a backup after validation. 
Backups allow recovery from media corruption. TDB 
provides fast incremental backups, which contain only 
changes made since a previous backup. 


Concurrent transactions: TDB provides serializable 
access to data from concurrent transactions. Unlike 
shared databases or file servers, TDB is not designed 
for simultaneous access by many users. Therefore, its 
concurrency control is geared to low concurrency. It 
employs techniques for reducing latency, but lacks so- 
phisticated techniques for sustaining throughput. 


Database size: TDB allows the database to scale with 
gradual performance degradation. It uses scalable data 
structures and fetches data piecemeal on demand. How- 
ever, it relies on a cacheable working set for perform- 
ance because its log-structured storage may destroy 
physical clustering. Another limitation is its no-steal 
buffering of dirty data, which does not scale to transac- 
tions with many modifications [GR93]. 


Objects: TDB stores abstract objects that the applica- 
tion can access without explicitly invoking encryption, 
validation, and pickling. TDB pickles objects using 
application-provided methods so the stored representa- 
tion is compact and portable. 


Collection and Indexes: TDB provides index mainte- 
nance over collections of objects. A collection is a set 
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of objects that share one or more indexes. An index 
provides scan, exact-match, and range iterators. 


3 System Architecture 


TDB is designed for use on personal computers as well 
as smaller devices. The architecture is layered, so appli- 
cations can trade off functionality for smaller code size. 
In Figure 2, boxes represent modules and arrows repre- 
sent dependencies between them. Dashed boxes repre- 
sent infrastructural modules. 
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Figure 2: System architecture 


The chunk store provides trusted storage for a set of 
named chunks. A chunk is a variable-sized sequence of 
bytes that is the unit of encryption and validation. (We 
expect chunk sizes between 100 bytes and 10 Kbytes.) 
All data and metadata from higher modules are stored 
as chunks. Chunks are logically grouped into partitions 
with separate cryptographic parameters. Partitions can 
be snapshot using the copy-on-write technique. 


Chunks are stored in the untrusted store. The chunk 
store supports atomic updates of multiple chunks in the 
presence of crashes. It hides logging and recovery from 
higher modules. This architecture does not support logi- 
cal logging, but the variable-sized chunks form a more 
compact log than fixed-sized pages. 
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The backup store creates and restores a set of partition 
backups. The chunk store and the backup store encapsu- 
late secrecy and tamper-detection. This enables the 
higher modules io provide database management with- 
out worrying about trust. 


The object store manages a set of named objects. It 
stores pickled objects in chunks—one or more objects 
per chunk. It keeps a cache of frequently-used or dirty 
objects. Caching data at this level is beneficial because 
the data is decrypted, validated, and unpickled. The 
object store also provides read transactional access to 
objects using read-write locking. 


The collection store manages a set of named collections 
of objects. It updates the indexes on a collection as 
needed. Collections and indexes are themselves repre- 
sented as objects. 


This paper focuses on integrating trust with storage 
management in the chunk store and the backup store. It 
describes higher modules briefly to show that the chunk 
store is able to support them, and to explain a high-level 
performance benchmark we use. 


4 Chunk Store: Single Partition 


To simplify presentation, this section describes the 
chunk store as it would be in the absence of multiple 
partitions. Section 5 describes multiple partitions. 


4.1 Specification 


The chunk store manages a set of chunks named with 
unique ids. It provides the following operations: 
e Allocate() returns Chunkid 
Returns an unallocated chunk id. 
e Write(chunkld, bytes) 
Sets the state of chunkId to bytes, possibly of differ- 
ent size than the previous state. Signals if chunkId is 
not allocated. 
e Read(chunkld) returns Bytes 
Returns the last written state of chunkld. 
Signals if chunkld is not written. 
e Deallocate(chunkld) 
Deallocates chunkId. 
Signals if chunkld is not allocated. 


Tamper Detection: In an idealized secret and tamper- 
proof chunk store, the operations listed above would be 
available only to trusted programs. Since tampering 
with the untrusted store cannot be prevented, the chunk 
store provides tamper-detection instead. It behaves like 
the tamper-proof store, except its operations may signal 
tamper detection if the untrusted store is tampered with. 
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Crash Atomicity and Durability: The write and deal- 
locate operations are special cases of a commit opera- 
tion. In general, a number of write and deallocate opera- 
tions may be grouped into a single commit, which is 
atomic with respect to fail-stop crashes. 


Allocated but unwritten chunks are deallocated auto- 
matically upon system restart. We have deliberately 
separated allocate and commit operations. An alterna- 
tive is to allocate ids when new, unnamed chunks are 
committed. However, this alternative does not allow an 
application to store a newly-allocated chunk id in an- 
other chunk during the same commit operation, which 
may be needed for data integrity. Systems that swizzle 
application-provided references into persistent ids upon 
commit do not face this problem. However, the chunk 
store does not interpret application data chunks. 


Concurrency Control: Operations are executed in a 
serializable manner. However, the chunk store is un- 
aware of transactions. Allocate, read, and commit op- 
erations from different transactions may be interleaved. 


4.2 Implementation Overview 


This section gives an overview of the implementation; 
subsequent sections give further detail. 


The chunk store writes chunks by appending them to a 
log in the untrusted store. As in other log-structured 
systems, chunks do not have static versions outside the 
log [RO91]. When a chunk is written or deallocated, its 
previous version in the log, if any, becomes obsolete. 


The chunk store uses a chunk map to locate and validate 
the current versions of chunks. To scale to a large num- 
ber of chunks, the chunk map is itself organized as a 
tree of chunks. Updates to the chunk map are buffered 
and written to the log occasionally. Updates lost upon a 
crash are recovered from the log. 


Secrecy is provided by encrypting chunks with the key 
in the secret store. Tamper-detection is provided by 
creating a path of hash links from the tamper-resistant 
store to every current chunk version. We say there is a 
hash link from data x to y if x contains a hash of some 
data that includes y. If x is linked to y via one or more 
links using a collision-resistant hash function, it is com- 
putationally hard to change y without changing x or 
breaking a hash link [Mer80]. The hash links are em- 
bedded in the chunk map and the log. 


Serializability of operations is provided through mutual 
exclusion, which does not overlap I/O and computation, 
but is simple and acceptable when concurrency is low. 


4.3 Chunk Map 


The chunk map maps a chunk id to a chunk descriptor, 
which contains the following information: 

e status of chunk id: unallocated, unwritten, or written 

e if written, current location in the untrusted store 

e if written, expected hash value of chunk 


Figure 3 shows the tree structure of the chunk map. The 
leaves are the chunks created by the applications of the 
chunk store; we call them data chunks. (These include 
chunks containing metadata of higher modules, for ex- 
ample, the indexing data of the collection store.) Each 
internal chunk, called a map chunk, stores a fixed-size 
vector of chunk descriptors. In the figure, each shaded 
slot is a chunk descriptor, and an arrow links the chunk 
containing the descriptor to the chunk described by the 
descriptor. The chunk at the top contains the descriptor 
of the root map chunk and some additional metadata 
needed to manage the tree; we call it the leader chunk. 
The descriptor of the leader chunk is retrieved at 
startup, as described later. The chunk store interprets 
map and leader chunks, but not data chunks. 


leader chunk 


map chunks 





data chunks 


1.3 


PLT AZ VAS DS LO ET AS 


Figure 3: The chunk map 


For uniformity of access and storage management, non- 
data chunks are also named using chunk ids. The id of a 
chunk encodes its position in the tree. The position 
comprises the height of the chunk in the tree and its 
rank from the left among the chunks at that height. In 
the figure, chunk ids are denoted as “height.rank’. As 
the tree grows, new chunks are added to the right and to 
the top, which preserves the positions of existing 
chunks. (The position of the leader does change, so it is 
given a reserved id instead.) Besides unifying access to 
chunks, this approach enables id-based navigation of 
the map without storing ids in the map explicitly. 


4.4 Allocate Operation 


Ids of deallocated data chunks are reused to keep the 
chunk map compact and conserve id space. Deallocated 
ids are linked through a free list embedded in the de- 
scriptors. The head of the list is stored in the leader. 
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As mentioned, id allocation is not persistent until the 
chunk is written (committed). Upon system restart, 
chunk ids that were previously allocated but not written 
are made available in the free list for re-allocation. 


4.5 Read Operation 


Given a chunk id c, its state may be located and vali- 
dated by traversing the path of descriptors from the 
leader to c. For each descriptor in the path, the chunk 
state is found as follows. The encrypted state is read 
from the location stored in the descriptor. It is de- 
crypted using the secret key. The decrypted state is 
hashed. If the computed hash does not match that stored 
in the descriptor, tamper detection is signaled. 


For better performance, the chunk map keeps a cache of 
descriptors indexed by chunk ids. Also, the leader 
chunk is pinned in the cache. The cached data is de- 
crypted, validated, and unpickled. 


If the descriptor for c is not in cache, the read operation 
looks for the descriptor of c’s parent chunk. Thus, the 
read operation proceeds bottom up until it finds a de- 
scriptor in the cache. Then it traverses the path back 
down to c, reading and validating each chunk in the 
path. This approach exploits the validated cache to 
avoid validating the entire path from the leader to the 
specified chunk. 


4.6 Commit Operation 


The commit operation hashes and encrypts each chunk 
to be written, and writes the encrypted state to the log in 
the untrusted store. We refer to the set of chunks written 
as the commit set. 


When a chunk c is written or deallocated, its descriptor 
is updated to reflect its new location, hash, or status. 
Conceptually, this changes c’s parent chunk d; if d were 
also written out, its descriptor would be updated, and so 
on up to the leader, whose descriptor would be written 
to the tamper-resistant store. Instead, to save time and 
log space, the chunk store updates c’s descriptor in 
cache and marks it as dirty so it is not evicted. The bot- 
tom-up search during reads ensures that the stale de- 
scriptor stored in d is not used. 


4.7 Checkpoint 


When the cache becomes too large because of dirty 
descriptors, all map chunks containing dirty descriptors 
and their ancestors up to the leader are written to the 
log. This is done as a special commit operation called a 
checkpoint. In practice, checkpoints happen infre- 
quently compared to regular commits. Other log- 
structured systems use similar checkpoints to defer and 
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consolidate updates to the location map [RO91]. The 
chunk store extends the optimization to propagating 
hash values up the chunk map. 


The leader is written last during a checkpoint. We refer 
to the part of the log written before the leader as the 
checkpointed log and the part including and after the 
leader as the residual log. Figure 4 shows a simple ex- 
ample, where the log tail contains some data chunks, 
possibly written in multiple commits, a checkpoint con- 
taining the affected map chunks and the leader chunk, 
and some more data chunks. Arrows link chunks as in 
Figure 3. 
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Figure 4: Checkpointing the chunk map 


4.8 Recovery 


A crash loses buffered updates to the chunk map, but 
they are recovered upon system restart by rolling for- 
ward through the residual log. Section 4.9 describes 
how the log is represented so the recovery procedure 
may find the sequence of chunks in the residual log. 


For each chunk in the residual log, the recovery proce- 
dure computes the descriptor based on its location and 
hash, and puts the descriptor in the chunk-map cache. 
This procedure requires additional support from the 
commit operation to redo chunk deallocations and to 
validate the chunks in the residual log. This is described 
in the next two sections. 


4.8.1 


For each chunk to be deallocated, the commit operation 
writes a deallocate chunk to the log, which contains the 
id of the deallocated chunk. 


Chunk Deallocation 


Deallocate chunks are instances of unnamed chunks: 
they do not have chunk ids or positions in the chunk 
map. This is acceptable because they are used solely for 
recovery from the residual log and are always obsolete 
in the checkpointed log. 


Like other chunks, unnamed chunks are encrypted with 
the secret key. They are also protected against tamper- 
ing, as described in the next section. Otherwise, an at- 
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tack could cause a chunk to be un-deallocated. Or, an 
attack could replay the deallocation of a chunk id after 
it was re-allocated. 


48.2 Validation of Residual Log 


Although checkpointing defers the propagation of hash 
values up the chunk map, each commit operation must 
still update the tamper-resistant store to reflect the new 
state of the database. If the tamper-resistant store kept 
the hash of the leader and were updated only at check- 
points, the system would be unable to detect tampering 
with the residual log after a crash. We have imple- 
mented two approaches for maintaining up-to-date vali- 
dation information in the tamper-resistant store. 


4.8.2.1 Direct Hash Validation 


The chunk store maintains a sequential hash of the re- 
sidual log. The log hash is stored in the tamper-resistant 
store and updated after every commit. Upon recovery, 
the hash in the tamper-resistant store is matched against 
the hash computed over the residual log. This approach 


is illustrated in Figure 5. 
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Figure 5: Tamper-resistant store contains database hash 


A commit operation waits until the commit set is written 
to the untrusted store reliably before it updates the hash 
in the tamper-resistant store. Otherwise, a crash could 
leave the tamper-resistant store updated when the un- 
trusted store is not, and cause validation to fail upon 
recovery. The update to the tamper-resistant store is the 
real commit point: If there is a crash during this update, 
the previous value stored in the tamper-resistant store is 
recovered, and the last commit set in the untrusted store 
is ignored. The commit operation returns after the tam- 
per-resistant store is updated reliably. 


Direct hash validation creates paths of hash links from 
the tamper-resistant store to all current chunk ver- 
sions—in both the residual log and the checkpointed 
log. This is true because the tamper-resistant store is 
directly linked to all chunks in the residual log, which 


includes the leader from the last checkpoint, and the 
leader is linked through the chunk map to all current 
chunk versions in the checkpointed log. Note that all 
unnamed chunks in the residual log are linked as well. 
Unnamed chunks in the checkpointed log are not linked, 
which is not a weakness because all such chunks are 
obsolete. 


4.8.2.2 Counter-based validation 


In this approach, upon each commit, a sequential hash 
of the commit set is stored in an unnamed chunk added 
to the log, called the commit chunk. The commit chunk 
is signed with the secret key. (The signature need not be 
publicly verifiable, so it may be based on symmetric- 
key encryption [MOV96].) An attack cannot insert an 
arbitrary commit set into the residual log because it will 
be unable to create an appropriately signed commit 
chunk. Replays of old commit sets are resisted by add- 
ing a count to the commit chunk that is incremented 
after every commit. Deletion of commit sets at the tail 
of the log is resisted by storing the current commit 
count in the tamper-resistant store. This approach is 
illustrated in Figure 6. 
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Figure 6: Tamper-resistant store contains commit count 





residual log 


A checkpoint is followed by a commit chunk containing 
the hash of the leader chunk, as if the leader were the 
only chunk in the commit set. The recovery procedure 
checks that the hash of each commit set in the residual 
log matches that stored in the commit chunk, and that 
the counts stored in the commit chunks form a se- 
quence. Finally, the procedure compares the count in 
the last commit chunk with that in the tamper-resistant 
store. The hash-links created in this approach are simi- 
lar to those in direct hash validation, except that the 
commit chunks are signed and linked from the tamper- 
resistant store through a sequence of numbers. 


Counter-based validation has several advantages. First, 
the tamper-resistant counter is a weaker requirement 
than a generic tamper-resistant store. Provided the 
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counter cannot be decremented by any program, it does 
not need additional protection against untrusted pro- 
grams. There is little incentive for untrusted programs 
to increment the counter because they would not be able 
to sign a commit chunk with the increased count. 


Second, the commit count allows the system to tolerate 
bounded discrepancies between the tamper-resistant 
store and the untrusted store, if desired. For example, 
the system might allow the count in the tamper-resistant 
store, t, to be a little behind the last count in the un- 
trusted store, u. This trades off security for perform- 
ance. The security risk is that an attack might delete 
commit sets t+] through u. The performance gain is that 
a commit operation need not wait for updating the count 
in the tamper-resistant store, provided (u-ft) is smaller 
than some threshold A,,. This is useful if the tamper- 
resistant store has high update latency. The system 
might also allow t to leap ahead of u by another thresh- 
old A,,. This admits situations where the untrusted store 
is written lazily (e.g., IDE disk controllers often flush 
their cache lazily) and the tamper-resistant store might 
be updated before the untrusted store. The only security 
risk is the deletion of at most A,, commit sets from the 
tail of the log. 


A drawback of counter-based validation is that tamper 
detection relies on the secrecy of the key used to sign 
the commit chunk. Therefore, if a database system 
needed to provide tamper-detection but not secrecy, it 
would still need a secret store. 


4.9 Log Representation 


This section describes the structure of the data written 
to the log. The log consists of a sequence of chunks; we 
refer to the representation of a chunk in the log as a 
version. 


4.9.1 


Chunk versions are read for three different functions: 

e Read operation, which uses the chunk id and the de- 
scriptor to read the current version. 

e Log cleaning, which reads a segment of the check- 
pointed log sequentially. 

e Recovery, which reads the residual log sequentially. 


Chunk Versions 


To enable sequential reading, the log contains informa- 
tion to identify and demarcate chunks. Each chunk ver- 
sion comprises a header followed by a body. The header 
contains the chunk id and the size of the chunk state. 
The header of an unnamed chunk contains a reserved id. 
Both the header and the body are encrypted with the 
secret key. Similarly, the hash of the residual log or a 
commit set covers both headers and bodies. 


4.9.2 Head of Residual Log 


The recovery procedure needs to locate the head and the 
tail of the residual log. The head of the residual log is 
the leader. Its location is stored in a fixed place, as in 
other log-structured storage systems. It need not be kept 
in tamper-resistant store: With direct hash validation, 
tampering with this state will change the computed hash 
of the residual log. With counter-based validation, it is 
possible for an attack to change the location to the be- 
ginning of another commit set. Therefore, the recovery 
procedure checks that the chunk at the stored location is 
the leader. 


Because the location of the leader is updated infre- 
quently—upon each checkpoint—storing it at a fixed 
location outside the log does not degrade performance. 
This location is written after the writes to the untrusted 
store and the tamper-resistant store have finished. Its 
update marks the completion of the checkpoint. If there 
is a crash before this update, the recovery procedure 
ignores the checkpoint at the tail of the log. 


4.9.3 Tail of Residual Log 


With direct hash validation, the location of the log tail 
may be stored in the tamper-resistant store along with 
the database hash. This works well because the write to 
the tamper-resistant store is the true commit point. 


With counter-based validation, it is possible to infer the 
location of the tail from the log itself, as in conventional 
databases [GR93]. The last commit set in the log may 
have been corrupted in a crash. The hash stored in a 
commit chunk serves well as a checksum for the commit 
set. The recovery procedure stops when the hash of a 
commit set does not match the hash stored in the com- 
mit chunk. 


49.4 Segments 


The untrusted store is divided into fixed-size segments 
to aid cleaning, as in Sprite LFS [RO91]. The segment 
size is chosen for efficient reading and writing by the 
cleaner, e.g., on the order of 100 KB for disk-based 
storage. A segment is expected to contain many chunk 
versions. The size of a chunk version cannot exceed the 
segment size. A commit set may span multiple seg- 
ments. 


The log is represented as a sequence of potentially non- 
adjacent segments. Since the recovery procedure needs 
to read the residual log sequentially, segments in the 
residual log contain an unnamed next-segment chunk at 
the end, which contains the location of the next seg- 
ment. 
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4.9.5 Log Cleaning 


The log cleaner reclaims the storage of obsolete chunk 
versions and compacts the storage to create empty seg- 
ments. It selects a segment to clean and determines 
whether each chunk version is current by using the 
chunk id in the header to find the current location in the 
chunk map. It then commits the set of current chunks, 
which rewrites them to the end of the log [BHS95]. 


The set of steps from selecting a segment to committing 
the current chunks happens atomically with respect to 
externally invoked operations. The cleaner may be in- 
voked synchronously when space is low, but it is mostly 
invoked asynchronously during idle periods. 


The cleaner does not clean segments in the residual log, 
because that would destroy the sequencing of the resid- 
ual log. This also resolves what the cleaner should do 
with unnamed chunks, because they are always obsolete 
in the checkpointed log. For performance reasons, the 
cleaner selects segments with low utilization. Details on 
the utilization metric and the maintenance of this infor- 
mation are beyond the scope of this paper. 


The cleaner need not validate the chunks read from the 
segment provided the commit operation for rewriting 
current chunks does not update the hash values stored in 
chunk descriptors. If the hashes are recomputed and 
updated, as they would be in a regular commit, the 
cleaner must validate the current chunks; otherwise, the 
cleaner might launder chunks modified by an attack. 
Because of its simplicity, we have implemented the sec- 
ond, less efficient, approach. 


5 Chunk Store: Multiple Partitions 


This section describes extensions to the chunk store that 
provide multiple partitions and partition copies. Multi- 
ple partitions enable the use of different cryptographic 
parameters for different types of data. Partition copies 
enable fast backups. 


5.1 Specification 


The chunk store manages a set of named partitions, each 
containing a set of named chunks. A chunk id comprises 
the chunk position, as before, and the id of the contain- 
ing partition. (A chunk in one partition may have the 
same position as another chunk in another partition.) 
The chunks in a partition are protected with the parame- 
ters associated with it. 
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The following partition operations are provided: 

e Allocate() returns Partitionld 
Returns an unallocated partition id. 

e Write(partitionld, secretKey, cipher, hashFunction) 
Sets the state of partitionId to an empty partition with 
the specified cryptographic parameters. 

e Write(partitionld, sourcePld) 

Copies the current state of sourcePId to partitionId. 
Each chunk in sourcePId is logically duplicated in 
partitionId at the same position. 

e Diff(oldPld, newPld) returns set<ChunkPosition> 
Returns a set containing chunk positions whose state 
is different in newPld and oldPld. 

e Deallocate(partitionld) 

Deallocates partitionId and all of its copies, and all 
chunks in these partitions. 


Furthermore, the chunk allocate operation requires the 
id of the partition in which the chunk is to be created. A 
commit operation may include a number of write and 
deallocate operations on both partitions and chunks. 
This makes it possible, for example, to store the id of a 
newly-written partition into a chunk in an existing parti- 
tion in one atomic step. 


The next few sections describe how the extended speci- 
fication is implemented. 


5.2 Miulti-partition Chunk Map 


Figure 7 shows the structure of the multi-partition chunk 
map. Each written partition has a position map, which 
maps a chunk position in the partition to a descriptor. 
This map is like the single-partition map described in 
Section 4.3. The map chunks in the position map of 
partition P belong to P: their partition id is P and they 
are protected using P’s cryptographic parameters. In the 
figure, chunk ids are denoted as partition. position. 


Li} system leader 






partition 
leader 








WL 1:12 1:13 1:14 2:1. 2:12 2:13 221.4 3:11 3:12 3:13 
Partition | Partition 2 Partition 3 


Figure 7: Multi-partition chunk map 
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The leader chunk for a partition contains information 
needed to manage the position map, as before, and the 
cryptographic parameters of the partition, including the 
secret key. The partition map at the top maps a partition 
id to the partition leader. This map is managed like the 
position map of a special partition, called the system 
partition, which has a reserved id denoted S in the fig- 
ure. The partition leaders are the data chunks of the 
system partition and are protected using the crypto- 
graphic parameters of the system partition. Many parti- 
tion operations such as allocating a partition id or read- 
ing a partition leader translate into chunk-level opera- 
tions on the system partition. 


Chunks in the system partition and the system leader are 
protected using a fixed cipher and hash function that are 
considered secure, such as 3DES and SHA-1 [MOV96]. 
They are encrypted with the key in the secret store. 
Thus, secrecy is provided be creating a path of cipher 
links from the secret store to every current chunk ver- 
sion. We say that there is a cipher link from one piece 
of data to another if the second is encrypted using a key 
stored in the first. 


5.3 Partition Copies and Diffs 


To copy a partition P to Q, the chunk store copies the 
contents of P’s leader to Q’s leader. Thus, Q and P 
share both map and data chunks, and Q inherits the 
cryptographic parameters of P. Thus, partition copies 
are cheap in space and time. 


When chunks in P are updated, the position map for P 
is updated, but that for Q continues to point to the 
chunk versions at the time of copying. The chunks of Q 
can also be modified independently of P, but the com- 
mon use is to create a read-only copy, called a snapshot. 


The chunk store diffs two partitions by traversing their 
position maps and comparing the descriptors of the cor- 
responding chunks. Commonly, diffs are performed 
between two snapshots of the same partition. 


5.4 Log Representation 


A commit set may contain chunks from different parti- 
tions. A chunk body is encrypted with the secret key 
and cipher of its partition. However, chunk headers are 
encrypted with the system key and cipher, so that clean- 
ing and recovery may decrypt the header without know- 
ing the partition id of the chunk. 


The system leader is the head of the residual log, so it is 
linked from the tamper-resistant store. The residual log 
is hashed using the system hash function. Thus, each 
chunk in a commit set is hashed twice: once with its 
partition-specific hash function to update the chunk 
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descriptor, and once with the system hash function to 
update the log hash. In principle, the log hash could be 
computed over the partition-specific hashes of chunk 
bodies. However, a weak partition hash function could 
then invalidate the use of the log hash as a checksum for 
recovery (see Section 5.4). For simplicity, and because 
hashing is relatively fast, we chose to keep the hashes 
separate. 


5.5 Cleaning and Recovery 


Checking whether a chunk version is current is compli- 
cated by partition copies. A chunk header contains the 
id of the partition P to which it belonged when the 
chunk was written. Even if the version is obsolete in P, 
it may be current in some direct or indirect copy of P. 
Therefore, each partition leader stores the ids of its di- 
rect copies and the cleaner checks for current-ness in 
the copies, recursively. The process would be more 
complex had it not been that the deallocation of a parti- 
tion deallocates the partition’s copies as well. 


Suppose the cleaner rewrites a chunk version identified 
as P:x that is current only in partitions Q and R. The 
commit procedure updates the descriptors for Q:x and 
R-x in the cache. Further, in order that the recovery pro- 
cedure is able to identify the chunk correctly, the 
cleaner appends an unnamed cleaner chunk, which 
specifies that the chunk is current in both Q and R. 


6 Backup Store 


The backup store creates and restores backup sets. A 
backup set consists of one or more partition backups. 
The backup store creates backup sets by streaming 
backups of individual partitions to the archival store and 
restores them by replacing partitions with the backu 
ps read from the archival store. 


6.1 Backup Consistency 


The backup store guarantees consistency of backup 
creation and restore with respect to other chunk store 
operations. Instead of locking each partition for the en- 
tire duration of backup creation, the backup store cre- 
ates a consistent snapshot of the source partitions using 
a single commit operation. It then copies the snapshots 
to archival storage in the background. We assume that 
restores are infrequent, so it is acceptable to stop all 
other activity while a restore is in progress. 


6.2 Backup Representation 


Partition backups may be full or incremental. A full 
partition backup contains all data chunks of the parti- 
tion. An incremental backup of a partition is created 
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with respect to a previous snapshot, the base, and con- 
tains the data chunks that were created, updated, or de- 
allocated since the base snapshot. Backups do not con- 
tain map chunks since chunk locations in the untrusted 
store are not needed. Chunks in a backup are repre- 
sented like chunk versions in the log. 


Createdempty Base snapshot New snapshot 


Partition P Partition Q Partition R 
| ———_@—__——__-@--—» Current state 
, Partition P 
eee eee > 
Incremental backup 
eee ee eo = D> 
Full backup 


Figure 8: Full and incremental backups 


A partition backup contains a backup descriptor, a se- 
quence of chunk versions, and a backup signature. The 
backup descriptor contains the following (illustrated 
using partition ids from Figure 8): 

e id of source partition (P) 

e id of partition snapshot used for this backup (R) 

e id of base partition snapshot (Q, if incremental) 

e backup set id (a random number assigned to the set) 

e number of partition backups in the backup set 

e partition cipher and hasher 

e time of backup creation 


The representation of partition backups is illustrated 
below. Here, H, denotes the system hash function, H, 
denotes the partition hash function, E, denotes system 
cipher using the system key, and E, denotes the parti- 
tion cipher using the partition key. 
PartitionBackup ::= 

E,(BackupDescriptor) 

( Es( ChunkHeader) Ep(ChunkBody) )* 

Backup Signature 

Checksum 


BackupSignature ::= 
E;(Hs(BackupDescriptor Hp((Chunkid ChunkBody)*))) 


The backup signature binds the backup descriptor with 
the chunks in the backup and guarantees integrity of the 
partition backup. The unencrypted checksum allows an 
external application to verify that the backup was writ- 
ten completely and successfully. 


6.3 Backup Restore 


The backup store restores a backup by reading a stream 

of one or more backup sets from the archival store. The 

backup store restores one partition at a time, enforcing 

the following constraints: 

e Incremental backups are restored in the same order as 
they were created, with no missing links in between. 


This is enforced by matching the base partition id in 
the backup descriptor against the id of the previous 
restored snapshot for the same partition. 

e If a partition backup is restored, the remaining parti- 
tion backups in the same backup set must also be re- 
stored. This is enforced by matching the number of 
backups with a given set id against the set size re- 
corded in backup descriptors. 


After reading the entire backup stream, the restored 
partitions are atomically committed to the chunk store. 
Backup restores require approval from a trusted pro- 
gram, which may deny frequent restoring or restoring of 
old backups. 


7 Object Store 


The object store adds safety against errors in applica- 
tion programs. It provides type-safe and transactional 
access to a set of objects. An object is the unit of typed 
data accessed by the application. The object store im- 
plements two-phase locking on objects and breaks dead- 
locks using timeouts. Transactions acquire locks in ei- 
ther shared or exclusive mode. We chose not to imple- 
ment granular or operation-level locks because we 
expect only a few concurrent transactions. The object 
store keeps a cache of frequently-used or dirty objects. 
Caching data at this level is beneficial because the data 
is decrypted, validated, and unpickled. 


The object store could store one or more pickled objects 
in each chunk. We chose to store each object in a dif- 
ferent chunk because it results in a smaller volume of 
data that must be encrypted, hashed, and written to the 
log upon a commit. In addition, the implementation of 
the cache is simplified since no chunk can contain both 
committed and uncommitted objects. On the other hand, 
storing each object in a different chunk destroys inter- 
object clustering and increases the database size due to 
per-chunk overhead (see Section 9.3). Because we ex- 
pect much of the working set to be cached, the lack of 
inter-object clustering is not important. 


8 Collection Store 


The collection store provides applications with indexes 
on collections of objects. A collection is a set of objects 
sharing one or more indexes. Indexes can be dynami- 
cally added and removed from each collection. Collec- 
tions and indexes are themselves represented as objects. 


The collection store supports functional indexes that use 
keys extracted from objects by deterministic functions 
{[Hwa94]. The use of functional indexes allows us to 
avoid a separate data definition language for the data- 
base schema. Indexes are maintained automatically as 
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objects are updated. Indexes may be unsorted or sorted, 
which is possible because the objects are decrypted. 


9 Performance 


In this section we describe preliminary performance 
measurements. First, we present the performance on 
chunk and backup store operations based on several 
micro-benchmarks. Then we compare the performance 
an off-the-shelf database system and TDB using a 
higher-level benchmark. 


9.1 Platform 


Performance was evaluated on a 450 MHz Pentium PC 
with 128 MB of RAM, running the Windows NT 4.0 
operating system. TDB is written in C++. 


The untrusted store was implemented as an NTFS file 
on a hard disk with 9 ms average seek and 7200 rpm (4 
ms average rotational latency). Using a raw disk parti- 
tion would be more efficient, but we do not expect the 
users of TDB to provide one. The total size of TDB 
caches (including the object cache and the chunk-map 
cache) was set to 4 Mbytes. 


The tamper-resistant store was emulated with an NTFS 
file on another hard disk to avoid interference with ac- 
cesses to the untrusted store. This disk has 12 ms aver- 
age seek and 5200 rpm (6 ms average rotational la- 
tency). The access time is similar to that for writing 
EEPROM, 5 ms [Inf00]. 


We used counter-based validation and allowed the 
count in the tamper-resistant store to lag behind that in 
untrusted store by A,, = 5. The tamper-resistant store is 
flushed only once is A,,commits. The untrusted store is 
flushed upon every commit and we set A,, to 0. 


9.2. Micro-benchmarks 


This section presents the performance of basic crypto- 
graphic, disk, chunk store and backup store operations. 


9.2.1 


Encryption: We used 3DES in CBC mode for the sys- 
tem partition, which has a measured bandwidth of 2.5 
MB/s (0.4 Us per byte). We used DES in CBC mode for 
other partitions; the measured bandwidth is 7.2 MB/s 
(0.14 ps per byte). There are other, more secure, algo- 
rithms that run faster than DES [MOV96]. 


Hashing: We used SHA-1. The measured bandwidth is 
21.1 MB/s (0.05 us per byte). Additionally, the “final- 
ization” of a hash value has a fixed overhead of 5 ps. 


Cryptographic and Disk Operations 
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Store latency: While the disk specs provide average 
latency, the measured latency varies widely based on 
the position of disk head. Furthermore, the latency of 
the NTFS flush operation for files larger than 512 bytes 
is doubled because it writes file metadata separately. 
We measured write latencies of 10 ms to 20 ms for 
small files and 25 ms to 40 ms otherwise. Therefore, we 
shall focus on the computational overhead and denote 
the latencies of the untrusted and tamper-resistant store 
symbolically as 1, and J,. 


Store bandwidth: The measured bandwidth, b,, of 
reading or writing the NTFS file implementing the un- 
trusted store varies between 3.5 and 4.7 MB/s. 


9.2.2 Chunk Store Operations 


We repeated each operation 10 times and found that the 
computational overhead does not vary much, typically 
deviating less than 2%. 


Allocate chunk id: This operation does not change the 
persistent state. The average latency is 6 ps. 


Write chunks + commit: We committed sets of 1 to 
128 chunks of sizes 128 bytes to 16 KB per chunk, 
which covers the range we expect. The computational 
latency, measured using linear regression, is 132 ps + 
36 ps per chunk + 0.24 ps per byte of cumulative chunk 
size. The fixed overhead comes largely from processing 
the commit chunk (pickling, encrypting, hashing, etc.), 
the per-chunk overhead from processing the chunk 
header and finalizing the chunk’s hash value, and the 
per-byte overhead from encryption and hashing the 
chunk bodies. The I/O overhead is 1, + I/A,, + 1/ b, per 
byte, which usually dominates the computational over- 
head. 


Read chunk: If the chunk descriptor is cached, the 
computational latency of reading a chunk is 47 ps + 
0.18 ps per byte of chunk size. The fixed overhead 
comes largely from processing the chunk header and 
finalizing the hash, and the per-byte overhead from de- 
cryption and hashing. The I/O overhead is |, + l/b, per 
byte. If the descriptor is not cached, the read operation 
reads in parental map chunks up to one whose descrip- 
tor is cached. In our experiments, each map chunk has 
64 descriptors and has a size of 1.5 KB. 


Write partition + commit: The computational latency 
of committing a new partition is 223 us. The computa- 
tional latency of copying a partition is 386 ps, regard- 
less of the number of chunks in the source partition, 
owing to our use of the copy-on-write technique. 
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9.2.3 Backup Store Operations 


We benchmarked only backup creation, we assume that 
backup restore performance is not critical. 


Partition backup: We used 512 byte chunks. The 
computational latency to create an incremental backup 
of a partition is 675 pts + 9 ps per chunk in the backed 
up partition + 278 ps per updated chunk. The fixed 
overhead comes mostly from creating the partition 
snapshot and processing the backup descriptor and sig- 
nature. The overhead per chunk in the backed up parti- 
tion comes from diff-ing the snapshot of the backed up 
partition against the base snapshot. The overhead per 
updated chunk comes from copying the chunk. 


The size of a backup determines the I/O overhead for 
writing it. The size of an incremental backup is 456 B + 
528 B per updated chunk, which may be significantly 
less than the size of a full backup. 


9.3. Space Overhead 


The chunk descriptor, header, and padding add an over- 
head of about 52 bytes for chunks encrypted using an 8- 
byte block cipher. The additional overhead per chunk 
due to the chunk map is small because the fanout degree 
of the tree is large (64). Obsolete chunk versions in the 
log add additional overhead. When cleaning in idle pe- 
riods, the space utilization may be kept as high as 90% 
with reasonable performance [BHS95]. 


9.4 Code Complexity 


Figure 9 gives the complexity of TDB in terms of num- 
ber of semicolons in C++ code. 


Collection store 
Object store 


Backup store 
Chunk store 
Common utilities 





Figure 9: TDB code complexity 


9.5 Performance Comparison 


In this section, we compare the performance of a system 
using either TDB or an off-the-shelf embedded database 
system, which we shall call XDB. The XDB-based 
system layers cryptography on top of XDB. We config- 
ured both systems to use the same cryptographic pa- 
rameters, cache size, and frequency of flushing the tam- 
per-resistant store. 


9.5.1. Workload 


We measured the performance on a benchmark that 

models two operations related to vending digital goods: 

e Bind: A vendor binds three alternative contracts to a 
digital good. 

e Release: A consumer releases the digital good select- 
ing one of the three contracts randomly. 


The benchmark first creates 30 collections for different 
object types. Each collection has one to four indexes. 
The benchmark loads the cache before executing an 
experiment. The experiment consists of 10 consecutive 
bind or release operations. Figure 10 gives the number 
of database operations executed in each experiment. 


|read__| update | delete | add _| 
release 181 10 
220 


Figure 10: Number of database operations. 
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9.5.2 Comparison Results 


We repeated each experiment 10 times. Figure 11 
shows the average times for the release and bind ex- 
periments, the part spent in the database system, and the 
part thereof spent in commit, which is the major over- 
head. 


— 


XDB-release TDB-release XDB-bind TDB-bind 


m@db-commit @db-other Onon-db 
Figure 11: Runtime comparison 


TDB outperformed XDB, primarily because of faster 
commits, but also in the remaining database overhead. 
We believe that XDB performs multiple disk writes at 
commit. 


The stored size of XDB after running the release ex- 
periment was 3.8 MB. The stored size of TDB was 4.0 
MB, based on 60% maximum log utilization. 


9.5.3. TDB Performance Analysis 


Here, we analyze the performance of the release ex- 
periment. Figure 12 breaks down the TDB overhead by 
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module. The time reported for each module excludes 
nested calls to other reported modules. The figure gives 
the average time (4/), the standard deviation (0), and 
percentage of total (%). 


[mode | in 





Figure 12: TDB runtime analysis 


The overhead is dominated by writes to the untrusted 
store. The experiment flushed the untrusted store 96 
times and the tamper-resistant store 19 times. The over- 
head of writing to the tamper-resistant store may vary 
significantly depending on the device and the frequency 
of flushes. There was no checkpoint or log cleaning 
during the experiment. (In the bind experiment, log 
cleaning took a total of 1030 ms.) 


The overhead of encryption and hashing is only 6% of 
the database overhead. The effective bandwidths of 
encryption and hashing are 6.5 MB/s and 20.6 MB/s, 
which are close to the peak bandwidths reported in Sec- 
tion 9.2.1. 


10 Potential Extensions 


The current design of TDB has a number of limitations. 
Below we describe extensions to address them. 


Untrusted storage on servers: TDB may be used to pro- 
tect a database stored at an untrusted server. This appli- 
cation of TDB may benefit from additional optimiza- 
tions for reducing network round-trips to the untrusted 
server, such as batching reads and writes. 


Trusted paging. The current design assumes that the 
entire runtime, volatile state of a trusted program is pro- 
tected by the trusted processing environment. TDB lim- 
its its volatile state by controlling its cache size, but this 
limit is not hard. Therefore, some volatile state may 
have to be paged out to untrusted storage. This problem 
may be solved by using a page fault handler to store 
encrypted and validated pages in the chunk store. 
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Steal buffer management. Currently, modified objects 
must remain in the cache until their transaction com- 
mits, which may degrade the security and performance 
of large transactions. Evicting dirty objects would re- 
quire writing them to the log. This requires additional 
support in the chunk store. 


Logical logging. Logical logging may reduce the vol- 
ume of data that must be encrypted, hashed, and written 
to the untrusted store. The chunk store uses logical log- 
ging for some operations (for example, deallocation of 
chunks), but it does not allow higher modules to specify 
operations that should be logged logically. 


11 Related Work 


There are many systems aimed at providing secure stor- 
age. TDB differs from most of them because of its 
unique trust model. 


In another paper at this conference, Fu et al. describe a 
read-only file system that may be stored in untrusted 
servers [FKMO0]. A hash tree is embedded in the inode 
hierarchy. The trusted creator signs the root hash with 
the time of update and expiration. This system is not 
designed to handle frequent updates or updates to indi- 
vidual file blocks in the untrusted server. 


Techniques for securing audit logs stored on weakly- 
protected hosts are suitable for securing append-only 
data that is read infrequently and sequentially by a 
trusted computer [BY97, SK98]. They employ a linear 
chain of hash values instead of a tree. When the data 
needs to be read, it is validated by recomputing the hash 
over the entire log. These techniques are not suitable for 
a database system such as ours, which requires frequent 
and random read-write access to data. 


Blum et al. considered the problem of securing various 
data structures in untrusted memory using a hash tree 
rooted in a small amount of trusted memory [BEG+91]. 
This work does not address storage management for 
persistent data. 


Some systems provide secure storage by dispersing data 
onto multiple hosts, with the expectation that at least a 
certain fraction of them (for example, two-thirds) will 
be honest. The data may be replicated as-is for time 
efficiency [CL99], or it might be encoded to reduce the 
cumulative space overhead [Rab89, Kra93, GGJ+97]. 
Read requests are broadcast to all machines and the data 
retumed is error corrected. This approach provides re- 
covery from tampering, not merely tamper detection. 
However, it relies on more trusted resources than are 
available to TDB. The expectation of an honest quorum 
is based on the assumption that, under normal opera- 
tion, the hosts are weakly protected but not hostile, so 
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the difficulty for a hostile party to take over k hosts in- 
creases significantly with k. 


Our use of log-structured storage builds on a previous 
work on_ log-structured storage systems [RO91, 
JKH93]. The Shadows database system is log structured 
and provides snapshots [Ylo94]. Otherwise, there has 
been little interest in log-structured database systems, 
perhaps because of the need to keep large sets of data 
physically clustered or to keep the log compact using 
logical logging. 
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We have presented a trusted database system that lever- 
ages a trusted processing environment and a small 
amount of trusted storage to extend tamper-detection 
and secrecy to a scalable amount of untrusted storage. 
The architecture integrates encryption and hashing with 
a low-level data model, which protects data and meta- 
data uniformly. The model is powerful enough to sup- 
port higher-level database functions such as transac- 
tions, backups, and indexing. 


Conclusions 


We found that log-structured storage is well suited for 
building such a system. The implementation is simpli- 
fied by embedding a hash tree in the comprehensive 
location map that is central to log-structured systems: 
objects can be validated as they are located. The check- 
pointing optimization defers and consolidates the 
propagation of hash values up the tree. Because updates 
are not made in place, a snapshot of the database state 
can be created using copy-on-write, which facilitates 
incremental backups. 


We measured the performance of TDB using micro- 
benchmarks as well as a high-level workload. The data- 
base overhead was dominated by writes to the untrusted 
store and the tamper-resistant store, which may vary 
significantly based on the types of devices used. The 
overhead of encryption and hashing was only 6% of the 
total. On this workload, TDB outperformed a system 
that layers cryptography on an off-the-shelf embedded 
database system, while also providing more protection. 
This supports the suitability of the TDB architecture. 
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Abstract 


Many boundaries impede the flow of authorization 
information, forcing applications that span those 
boundaries into hop-by-hop approaches to autho- 
rization. We present a unified approach to autho- 
rization. Our approach allows applications that span 
administrative, network, abstraction, and protocol 
boundaries to understand the end-to-end authority 
that justifies any given request. The resulting dis- 
tributed systems are more secure and easier to audit. 

We describe boundaries that can interfere with 
end-to-end authorization, and outline our unified ap- 
proach. We describe the system we built and the 
applications we adapted to use our unified autho- 
rization system, and measure its costs. We conclude 
that our system is a practical approach to the desir- 
able goal of end-to-end authorization. 


1 Introduction 


As systems grow more complex, they are often grown 
by affixing one system to another using some form of 
gateway to bridge boundaries between the systems. 
The boundaries can take several forms; we discuss 
four in this paper. 

When we assemble systems in this way, frequently 
the authorization information available at the client 
system cannot be translated to the terms of autho- 
rization at the server system. As a result, the gate- 
way often ends up making access-control decisions 
on behalf of the server system, and the server sys- 
tem is ignorant of any authorization information be- 
yond a blind trust in the gateway. Our end-to-end 
authorization system remedies this situation. 


2 Goals 


Saltzer et al. describe a general principle for com- 
puter engineering: implement end-to-end semantics 


*Supported by a research grant from the USENIX 
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to achieve correctness, and only implement hop-by- 
hop semantics to boost the performance of the end- 
to-end implementation [19]. Voydock and Kent ar- 
gue for end-to-end security measures when the hops 
are between network routers [24]. The same prin- 
ciple applies to authorization semantics when the 
hops are between gateways that span administra- 
tive boundaries, network scales, levels of abstraction, 
or protocol boundaries. End-to-end authorization 
makes systems more secure by reducing the number 
of programs that make access-control decisions, by 
giving those programs that do control access more 
thorough information, and by providing more useful 
audit trails. In this section, we illustrate four kinds 
of boundaries in distributed systems that impede the 
flow of authorization information from one end of a 
system to another. We discuss how, by giving clients 
and servers the ability to form and verify proofs, our 
unified system can support end-to-end authorization 
through the gateways that span these boundaries. 


2.1 Spanning administrative domains 


Administrative boundaries frequently interfere with 
end-to-end authorization. The conventional ap- 
proach to authorization involves authenticating the 
client to a local, administratively-defined user iden- 
tity, then authorizing that user according to an 
access-control list (ACL) for the resource. When 
resources are to be shared across administrative 
boundaries, this scheme fails because the server has 
no local knowledge of the recipient’s identity. 
Typical solutions to this problem involve authenti- 
cating the remote user in the local domain, either by 
having the local administrator create a new account, 
or by the resource owner sharing her password. An- 
other approach is to install a gateway that accesses 
the resource with the local user’s privilege but on 
behalf of the remote user. With the gateway the 
owner achieves her goal of sharing, but obscures the 
identity and authority of the actual client from the 
service that supplies the underlying resource. 
Another way a user might share resources across 
administrative boundaries is by delegating! her au- 
thority with restriction. In the example, Alice may 


1We call delegation what Abadi et al. call handoff. 
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authorize Bob to perform some restricted set of ac- 
tions on certain resources. Authority information 
flows across the administrative boundary: the del- 
egation provides the resource server with sufficient 
information to reason about the client regardless of 
her membership in the local administrative domain. 
Indeed, the authorization mechanism has no inher- 
ent notion of administrative domain. 


2.2 Spanning network scales 


A second boundary that interferes with end-to-end 
authorization is network scale. Network scale affects 
an application’s choice of hop-by-hop authorization 
protocol. For example, a strong encryption protocol 
is appropriate when crossing a wide-area network. 
Inside a firewall where routers are locally adminis- 
tered, some installations may base authority deci- 
sions on IP source addresses. On a local machine, 
we can often trust the OS kernel to correctly identify 
the participants in an interprocess communication. 

Our unified approach separates policy from mech- 
anism, creating two benefits. First, applications rea- 
son about policy using a toolkit with a narrow inter- 
face. The toolkit can transparently support multi- 
ple access mechanisms, and simply enable those that 
policy allows. Second, when an application does not 
support a desired mechanism, we can build a gate- 
way that forwards requests from another mechanism 
while still passing end-to-end authorization informa- 
tion in a form the server understands and verifies. 
Ultimately, the high-level security analysis of a pro- 
gram is independent of mechanism, and reflects end- 
to-end trust relationships. 


2.3. Spanning levels of abstraction 


Another use for gateway programs is to introduce 
another level of abstraction over that provided by 
a lower-level resource server. A file system takes 
disk blocks and makes files; a calendar takes rela- 
tional database records and makes events; a source- 
code repository takes files and makes configuration 
branches. ‘Typically, an abstracting gateway con- 
trols the lower-level resource completely and exclu- 
sively, so that the gateway makes all access-control 
decisions. With end-to-end authorization, one can 
instead allow multiple mutually untrusting gateways 
to share a single lower-level resource. 

For example, a system administrator might con- 
trol the disk-block allocator. To grant Alice access 
to a specific file X, the sysadmin may allow Alice 
to speak for the file system regarding X, and allow 
the conjunction of Alice and the file system quoting 
Alice to speak for the disk blocks. In this configu- 
ration, the file system cannot access the lower-level 
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disk block resource without Alice’s agreement (due 
to the conjunction), and Alice cannot meddle with 
arbitrary disk blocks without the file system agree- 
ing that the requests are appropriate. The system 
helps us adhere to the principal of least privilege by 
encoding partial trust in the user and in the file sys- 
tem program. Furthermore, auditing any request for 
disk blocks provides end-to-end information indicat- 
ing the involvement of both Alice and the file system 
program. 


2.4 Spanning protocols 


Commonly a gateway is installed between two sys- 
tems simply to translate requests from one wire pro- 
tocol to another. Like any gateway, these gateways 
often impede the flow of authorization information 
from client to server. 

In our system, authorization information is en- 
coded in a data structure that has both robust and 
efficient wire transfer encodings [18]. Thus the uni- 
fied system is easily adapted for transfer over a va- 
riety of existing protocols. In this paper, we de- 
scribe its implementation over HTTP and over Java 
Remote Method Invocation (RMI). Adapting more 
protocols, such as NFS and SMTP, to support uni- 
fied authorization will result in wider applicability 
of end-to-end authorization. 


The four boundaries described above turn up in 
real systems that accrete from smaller subsystems. 
Gateway software installed at each boundary maps 
requests from clients on one side of the boundary 
to requests for services on the other side. The sys- 
tem described herein allows us, at each boundary, to 
preserve the flow of authorization information along- 
side the flow of requests. By allowing gateways to 
defer authorization decisions to the final resource 
server when appropriate, and ensuring that resource 
servers have a full explanation for the authority of 
the requests they service, we provide applications 
with end-to-end authorization. 


3 Unified authorization 


Above, we motivate the use of a unified system 
to support end-to-end authorization, and allude to 
some of its features. In this section, we give an 
overview of the system we built, part of a project 
called Snowflake that facilitates naming and sharing 
across administrative boundaries. 

The main idea behind our end-to-end authoriza- 
tion is a compact logic of authority. The logic is 
founded in a possible-worlds semantics that provides 
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intuition and guidance about possible extensions. 
Due to its length, the detailed semantics appears 
in a companion paper [11]. 

Logical assumptions represent statements that a 
principal believes based on some verification (out- 
side the logic), such as the result of a digital sig- 
nature verification. Principals combine assumptions 
and logical theorems to produce inherently auditable 
proofs of authority. Such proofs are not bearer capa- 
bilities but simply verifiable facts: while they prove 
that a given principal has authority, knowledge of 
the proof by an adversary does not bestow author- 
ity on the adversary. The primary form of statement 


is BS A, read “Bob speaks for Alice regarding the 
statements in set T.” The statement means that Al- 
ice agrees with Bob about any statement in T that 
Bob might make; the speaks for captures delegation, 
and the regarding captures restriction. 

The logic stems from the Logic of Authentica- 
tion due to Abadi, Burrows, Lampson, Plotkin, and 
Wobber [1, 13, 25]; as in their logic, ours can en- 
code conjunction (multiple parties exercising joint 
authority) and quoting (one party claiming to speak 
on behalf of another). The logic is backed by a se- 
mantics that not only provides unambiguous mean- 
ing for every logical statement, but tells us how the 
system may and may not be safely extended. 

The formalism suggests a natural implementation 
language that fits nicely with the Simple Public Key 
Infrastructure (SPKI) [9]. Our system generalizes 
SPKI by allowing other forms of principal, so that 
the same framework can be used for authorization 
on asingle host using a trusted kernel, authorization 
within an administrative domain using a secret-key 
protocol, or authorization in the wide area using a 
public key protocol. We extended the SPKI frame- 
work rather than create our own to simplify poten- 
tial interoperation with SPKI, to exploit SPKI’s un- 
ambiguous S-expression representation, and to build 
on existing implementations of SPKI in C and Java. 

We present the implementation in three sections: 
the infrastructure of the system, the channels of 
communication we have supported, and some ap- 
plications that exploit the authorization model. 
The applications culminate in a configuration that 
bridges each of the four boundaries described above. 

Principals, statements and proofs are the language 
of our system. Section 4 describes each, and dis- 
cusses our implementation. It also describes the 
Prover, a tool used by clients to generate proofs. 
Requests to be authorized are delivered over various 
kinds of channels, from fast local channels connected 
by a trusted kernel, to cryptographically-protected 
network connections. We discuss our implementa- 
tion of authorization over channels in Section 5. In 
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Section 6, we describe the applications and services 
we have built that participate in and interoperate 
using the unified authorization system. We measure 
and analyze the costs of our approach in Section 7. 
Section 8 discusses related work, and we summarize 
in Section 9. 


4 Infrastructure 


The basic elements of the system are statements and 
principals. A statement is any assertion, such as “it 
would be good to read file X,” or “Bob speaks for 
Alice,” or “Charlie says Alice speaks for Charlie.” 
A principal is any entity that can make a state- 
ment. Examples include the binary representation 
of a statement itself (that says only what it says), a 
cryptographic key (that says any message signed by 
the key), a secure channel (that says any message 
emanating from the channel), a program (that says 
its output), and a terminal (that says whatever the 
user types on it). 

A proof of authority, like a proof of a mathemati- 
cal theorem, is simply a collection of statements that 
together convince the reader of the veracity of the 
conclusion statement. Of course, in an authoriza- 
tion system, a proof is read by a program, not by a 
mathematician. 


4.1 Statements 


Snowflake’s implementation of sharing begins with 
the Java implementation of SPKI by Morcos [14]. 
It is a useful starting point because not only do we 
wish to preserve features of SPKI, but SPKI includes 
a precise and easily extensible specification of the 
representation of various abstractions. Furthermore, 
starting with a SPKI implementation offers an easier 
path to SPKI interoperability. 

The restriction imposed on a delegation is speci- 
fied using authorization tags from SPKI. Authoriza- 
tion tags concisely represent infinitely refinable sets, 
which makes them an attractive format for user- 
definable restrictions. We replaced Morcos’ minimal 
implementation of authorization tags with a com- 
plete one that performs arbitrary intersection oper- 
ations [12, Chapter 6].Our semantics paper explains 
how SPKI’s revocation mechanisms (lists and one- 
time revalidations) can be expressed as statements 
in our logic [11]. 


4.2 Principals 


SPKI makes a distinction between principals and 
“subjects,” entities that can speak for others but 
can utter no statements directly, such as threshold 
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(conjunct) principals. Our formalism does not make 
that distinction. It also supports new compound 
principals, such as the quoting principal of Lampson 
et al. Therefore, we extended Morcos’ Principal class 
to support SPKI threshold (conjunction) principals 
and Lampson’s quoting principals. When a service 
reads a request from a communications channel, it 
associates the request with an appropriate principal 
object that represents the channel; this principal is 
the one that “says” the request. Because the chan- 
nel itself is a principal, it may claim to quote some 
other principal; that assertion is noted by associ- 
ating the channel with a Quoting principal object. 
The object’s quoter field is the channel itself, and 
its quotee field is the (possibly compound) principal 
the channel claims to quote. 


4.3. Proofs 


We implemented a Proof class that represents a 
structured proof consisting of axioms and theorems 
of the logic and basic facts (delegations by princi- 
pals). An instance of Proof describes the state- 
ment that it proves and can verify itself upon re- 
quest. While Proof objects may be received from 
untrusted parties, their methods are loaded from a 
local code base, so that the results of verification are 
trustworthy. Servers receive from clients instances 
of the Proof class that show the client’s authority 
to request service. Conversely, a server may send a 
Proof to a client to establish its authenticity, that 
is, to prove its authority to identify itself by some 
name or to provide some service the client expects. 

Proofs can be transmitted as SPKI-style S- 
expressions or directly transferred between JVMs 
using Java serialization. No precision is lost in the 
latter case, since the basic internal structure of ev- 
ery proof component is a Java object corresponding 
to an S-expression. 

SPKI’s sequence objects also represent proofs of 
authority. SPKI sequences are poorly defined, but 
they are linear programs apparently intended to run 
on a simple verifier implemented as a stack machine. 
When certificates and opcodes are presented to the 
machine in the correct order, the machine arrives at 
the desired conclusion [8]. 

Transmitting proofs in a structured form rather 
than as SPKI sequences is attractive for three rea- 
sons. First, the structured proofs clearly exhibit 
their own meaning; to quote Abadi and Needham, 
“every message should say what it means” [2]. Sec- 
ond, the structured proof components map one-to- 
one to implementation objects that verify each com- 
ponent. The SPKI sequence verifier, in contrast, 
requires an external mapping to show that the state 
machine corresponds to correct application of the 
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formal logic. Third, it is simple to extract lem- 
mas (subproofs) from structured proofs, allowing 
the prover to digest proofs into reusable components 
(Section 4.4). 

The logic encodes expiration times as part of the 
restriction of a delegation, so that each proof need 
be verified only once. The step of matching a request 
to a proof automatically disregards expired conclu- 
sions, since a current request cannot match a con- 
clusion with a restriction that it was valid only in 
the past. Figure 1 illustrates a proof. Since the 
structure of the proof is preserved, if the topmost 
statement should expire (perhaps because it depends 
on the short-lived statement Hp => Ks), the still- 
useful proof of Ks = Kc: N may be extracted and 
reused in future proofs. 


transitivity 
Hp=> Kc :N 


ge ee 


transitivity signed-certificate 
Ks > Kc-N Hp => Ks 


ee es, 


name-monotonicity signed-certificate 
Ho: N=>Keo:(N Ks > Ho N 





hash identity 
Ho =>Ke 


Figure 1: A structured proof. This proof shows that 
document D is the object client C associates with the 
name N. Hx, is a hash of the client’s key Kc, Hp a 
hash of the document, and Kg the server’s key. 


4.4 The prover 


A Prover object helps Snowflake applications col- 
lect and create proofs. It has three tasks: it collects 
delegations, caches proofs, and constructs new dele- 
gations. 

A user’s application collects delegations from 
other users. Gateways collect delegations directly 
from client applications. Both sorts of applications 
use a Prover to maintain their collected delega- 
tions in a graph where nodes represent principals 
and edges represent a proof of authority from one 
principal to the next (see Figure 2). The Prover 
traverses the graph breadth first to find proofs of 
delegation required by the application. For exam- 
ple, if the Prover must prove that a channel Kcy 
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speaks for a server S, it works backwards from the 


node S to find the proof that A vee eo ais fi- 
nal, meaning that the Prover can make statements 
as A; therefore, Prover simply issues a delegation 
Koy => A to complete the proof. 





Figure 2: A look inside Alice’s Prover. Each node rep- 
resents a principal, and each edge a proof. For example, 
the edge from A to B represents the proof consisting of 
the single delegation A 4% B. The node A is distin- 
guished because it is final: it represents a principal that 
the Prover can cause to say things. 


When the Prover receives a delegation that is ac- 
tually a proof involving’ several steps, the Prover 
“digests” the proofinto its component parts for stor- 
age in the graph. Whenever it receives or computes a 
derived proof composed of smaller components, the 
Prover adds a shortcut edge (dotted line in Figure 2) 
to the graph to represent the proof. These shortcuts 
form a cache that eliminates most deep traversals of 
the graph. 

When an application controls one or more princi- 
pals (e.g., by holding the corresponding private key 
or capability), its Prover can store a closure (an 
object that knows the private key or how to exer- 
cise the capability) in its graph to represent the con- 
trolled principal. When desired, the Prover can not 
only find existing proofs, but complete new proofs 
by finding an existing chain of delegations from the 
controlled principal to the required issuer, then us- 
ing the closure to delegate to the required subject 
restricted authority over the controlled principal. 

Our simple Prover is incomplete, but it is suit- 
able for most authorization tasks applications face. 
Abadi et al. note that solutions to the general access- 
control problem in the presence of both conjunc- 
tion and quoting require exponential time [1, p.726]. 
Elien gives a polynomial-time algorithm for discov- 
ering proofs in a graph with only SPKI certificates 
(no quoting principals) [7]. In the common case, we 
expect applications to collect authorization informa- 
tion in the course of resolving names, so that proofs 


are built incrementally with graph traversals of con- 
stant depth. 


5 Channels 


With the infrastructure above in place, applications 
and services have the tools they need to generate, 
propagate, and analyze authority from the source of 
a request to its final resource server. The autho- 
rization information must be propagated from one 
program to the next through channels. 

When a client makes a request of a server, the 
server needs some mechanism to ensure that the 
client really uttered the request. We implemented 
three such mechanisms: a secure network channel, 
a local channel vouched for by a trusted authority 
in the same (virtual) machine, and a signed request. 
We describe each and discuss how they are repre- 
sented as principals in our unified system. 


5.1 Secure channels 


To implement a secure channel, we built a Java im- 
plementation of the ssh protocol that can interoper- 
ate with the Unix sshd service [26]. Then we built 
Java ServerSocket and Socket classes based on ssh 
that provide a secure connection. Either end of the 
connection can query its socket to discover the pub- 
lic key associated with the opposite end.” 

We plugged our ssh sockets into RMI using socket 
factories. Ssh ensures that the channel is secure be- 
tween some pair of public keys. To make that guar- 
antee useful, we embody the channel as a principal. 
Consider the channel in Figure 3. To establish the 
channel, the server (principal Ps) uses public key 
Ky and the client (Pc) key Kg in the key exchange, 
and together they establish secret key Koy as the 
symmetric session key. 


={ 9} # 
K> channel with secret key Key Ky 


client (Pe) 





server (Ps) 


Figure 3: Treating a channel as a principal 


Suppose a message M emerges from the channel 
at the server. In the language of the formalism, 


2Why did we build an ssh implementation? Some have 
suggested that we use SSL over RMI, which is apparently now 
fairly practical. When we began this work, however, RMI did 
not have easily pluggable socket factories, and even once it 
did, the only open-source SSL implementation we could find 
did not operate well under RMI. 
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the ssh implementation promises that M > Kcu. 
The initial key exchange convinced the server that 
Kou => Ko, and the client may explicitly establish 
that Ky = Po. Because M => Koy = Kz = Po, 
the server concludes that M => Po, that is, the mes- 
sage says what the client is thinking. 


5.1.1 How channels work 


Figure 4 illustrates our RMI/ssh channel in action. 
Initially, the server creates an instance of an RMI re- 
mote object @, defines the key Kg that controls it, 
and associates the object with an SSHContext that 
manages any incoming messages for the object @. 
The SSHContext is associated with the RMI listener 
socket @ that will receive incoming requests for the 
object, and defines the public key (K 1) that will par- 
ticipate in ssh session establishment. 


The client retrieves a stub @ for the remote ob- 
ject from a name service it trusts. To exercise its au- 
thority on the object, the client first establishes its 
identity in thread scope. In a try ... finally 
block, it establishes its own SSHContext @ and 
a Prover @ that holds its private key Kc. Any 
method called in the run-time scope of the try block 
will inherit the established authority, but the author- 
ity will be canceled when control exits the block. 


Then the client invokes a method m on the re- 
mote stub. The remote stub has been mechanically 
rewritten to wrap its remote invocations with calls to 
the invoker helper method @. The invoker method 
makes the usual RMI remote call through the re- 
mote reference @, and the reference creates an ssh 
socket @ using the SSHSocketFactory specified in 
the stub. The ssh channel is established @, and 
each context learns the public key associated with 
the opposite end (Ki, K2). The method call passes 
through the channel to the skeleton object on the 
server @, which forwards the call to the implemen- 
tation object. 

The programmer has prepended to each remote 
method implementation a call to the no-argument 
method checkAuth() @. This routine discovers 
from the local SSHContext the key Ko associated 
with the channel that the request arrived on, and 
concludes K2 says m. The server object was asso- 
ciated at creation with the key Kg, however, and 
checkAuth() does not know that K2 speaks for Ks, 
so it throws an Sf{NeedAuthorizationException. 


RMI passes the exception back through the chan- 
nel, where the client’s invoker method catches it. 
The invoker inspects the exception to discover the 
issuer Kg it must speak for and the minimum re- 
striction set regarding which it must speak for that 
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issuer.2 The invoker queries the Prover @ for a 


proof of the required authority; since the prover con- 
trols the client’s private key Kc, it can construct 
a statement to delegate authority from Kc to Ko. 
The exception carries a reference to a special re- 
mote proofRecipient object; the invoker calls a 
method on it to pass @ the proof to the server. The 
proofRecipient object @ stores the proof at the 
server, and returns to the client. 

The invoker again sends the original invocation m 
through the remote reference, and the request trav- 
els the same path to checkAuth on the server. This 
time, the proof that Ko as Kg (via Kc) is avail- 
able, checkAuth() returns without exception, and 
the remote object’s implementation method runs to 
completion. Future calls encounter no exception as 
long as the proof at the server remains valid, and are 
only slowed by the layer of encryption protecting the 
integrity of the ssh channel. 

The client programmer need only establish the 
client’s authority at the top of a code block; in- 
side that scope, the Prover and the invoker together 
handle the nitty-gritty of proof generation and au- 
thorization. In the idiom we adopt, the server pro- 
grammer defines the object server key Kg and the 
mapping from method invocation to restriction set 
(T) for a server object, then prefixes each Remote 
method with calls to a generic checkAuth() that 
uses those definitions. We chose this approach be- 
cause it would be simple to automate the injection 
of checkAuth() calls to insure that no Remote in- 
terface is left unprotected. 


5.2 Local channels 


Setting up a secure network channel is an expensive 
operation because it involves public-key operations 
to exchange keys. If a server trusts its host machine 
enough to run its software, it may as well trust the 
host to identify parties connected to local IPC chan- 
nels. Within our Java environment, we treat the 
JVM and a few system classes as the trusted host, 
and bypass encryption when connecting to a server 
in the same JVM. 


3In this example, the minimum restriction set T = {m} 


contains the singleton request (method invocation) made by 
the invoker. When some more-sophisticated mapping is in- 
volved, where the server’s minimum restriction set may re- 
veal sensitive structure of the service, the server may reveal 
the set only incrementally. For example, its first challenge 
may tell the client how to prove authority to learn the “real” 
restriction set. The situation is analogous to 1s -1 foo/bar 
in Unix: it reveals the authority required by a client to access 
a resource bar, but only after the client has shown its author- 
ity to learn that information by logging in with a UID that 
has permission to read the directory foo. 
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Figure 4: How our ssh RMI channel is integrated with Snowflake’s authorization service. 
Solid arrows ——— represent the critical remote call path, and dotted arrows 


----— represent object references. 
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settee > represent the longer path taken when the server requires fresh proof of the client’s authority. 


In the local case, the ssh channel is replaced with 
a Java “IPC” pipe implemented without any operat- 
ing system IPC services, and the public keys corre- 
sponding to the channel endpoints (A, and K) are 
swapped directly. Because it was involved in con- 
structing the key pairs and the keys are stored in 
immutable objects, the trusted system class knows 
whether a client holds the private key corresponding 
to a given public key. Hence when a client is colo- 
cated in the same JVM with the server, there is no 
encryption or system-call overhead associated with 
the channel, only RMI serialization costs. 


5.3 Signed requests 


Not all applications can assume that our ssh- 
enhanced version of RMI is available as an RPC 
mechanism. Indeed, the most visible RPC mech- 
anism on the Internet is HTTP. To facilitate ap- 
plications that use HTTP, we created a Snowflake 
version of the HTTP authorization protocol. 

HTTP defines a simple, extensible challenge- 
response authorization mechanism [10]. The client 
sends an HTTP request to the server. The server 
replies with a “401 Unauthorized” response, in- 
cluding a WWW-Authenticate header describing the 
method and other parameters of the required au- 
thorization. The client resends its request, this 
time including an Authorization header. If the 
Authorization satisfies the server’s challenge, the 
server honors the request and replies with the re- 
turn value of the operation. Otherwise, the server 
returns a “403 Forbidden” response to indicate the 
authorization failure. 

HTTP defines two standard authorization meth- 


ods. In Basic Authentication, the client’s 
Authorization header includes a password in 
cleartext. In Digest Authentication, the server’s 
WWW-Authenticate challenge includes a nonce, and 
the client’s Authorization header consists of a se- 
cure hash of the nonce and the user’s password. 
Both methods authenticate the client as the holder 
of a secret. password, and leave authorization to an 
ACL at the server. 

In our new method, called Snowflake Autho- 
rization, the parameters embedded in the server’s 
WWW-Authenticate challenge are the issuer that the 
client needs to speak for and the minimum re- 
striction set that the delegation must allow. The 
Authorization header in the client’s second request 
simply includes a Snowflake proof that the request 
speaks for the required issuer regarding the speci- 
fied restriction set. The subject of the proof is a 
hash of the request, less the Authorization header. 
Figure 5 shows an example. 


5.3.1 Signed request optimization 


The signed request protocol described above is 
rather slow, since it incurs a public-key signature 
for every request. We implemented a more efficient 
protocol that amortizes the public-key operation by 
having the server send an encrypted, secret message 
authentication code (MAC) to the client. The client 
then authorizes messages by sending a hash of ( mes- 
sage, MAC ). The protocol is represented in the 
end-to-end authorization chain by representing the 
MAC as a principal. 

SSL channels offer an alternative approach to 
amortizing the initial public-key operation, with dif- 
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HTTP/1.0 401 UNAUTHORIZED 
Content-Type: text/html 
MIME-Version: 1.0 
Server: MortBay-Jetty-2.3.3 
Date: Sat, 08 Apr 2000 15:18:47 GMT 
WWW-Authenticate: SnowflakeProof 
Authorize-Client 
Sf-ServiceIssuer: (hash md5 
| ehtQYd4EpQX0a/ON6Smesg==| ) 
Sf-MinimumTag: (tag 
(web (method GET) 
(service |Sm9uJ3MgUHJvdGV jdGVpY2U=| ) 
(resourcePath ""))) 
Connection: close 


Figure 5: An HTTP authorization challenge message 
from a Snowflake server. It indicates the method, the 
required resource issuer, and the minimum restriction of 
a delegation that must be proven. 


ferent security and performance trade-offs. 


5.3.2 Authorization vs. authentication 


The SPKI group argues that authorizing a request 
without authentication as an intermediate step re- 
duces indirection and hence removes opportunities 
for attack [9]. When authentication is desired, one 
can use the logic to demand it. For example, one 
may delegate a resource to “authentication server’s 
Alice,” requiring Alice to authenticate herself to the 
server to invoke her authority over the resource. Al- 
ternatively, one can resolve the secure bindings that 
map keys to names after the fact to discover whose 
authority was invoked. How meaningful an authen- 
tication is depends on one’s philosophy about dele- 
gation control [11]. 


5.3.3 Server authorization 


Often a client also wants to verify that it is com- 
municating with the “right” server. The notion 
of “right” can be as simple as the server speak- 
ing for the client’s idea of a well-known name like 
www.dartmouth.edu, but in general the real ques- 
tion is still one of authorization: Does this server 
have the right to claim authority about Dartmouth’s 
course list? Does that server have authority to re- 
ceive my e-mail? 

We addressed a limited version of this problem 
with a second HTTP extension that enables a server 
to show the authenticity of adocument using the au- 
thorization system. The server includes with docu- 
ment headers a proof that the hash of the document 
speaks for the server. The client completes the proof 
chain and determines whether the authentication is 
satisfactory. 
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5.3.4 Server implementation 


We implement the server side of the signed-requests 
protocol as an abstract Java Servlet Protected- 
Servlet [15]. Concrete implementations extend 
ProtectedServlet with a method that maps a re- 
quest to an issuer that controls the requested re- 
source and to the minimum restriction set required 
to authorize the request. The concrete class also 
supplies the service implementation that maps a re- 
quest to a response. When each request arrives, the 
ProtectedServlet ensures that appropriate autho- 
rization has been supplied, and if not, constructs 
and returns the “401 Unauthorized” response to the 
client. 

Notice that the server identifies only a single prin- 
cipal that controls the resource, not an ACL. An 
ACL is a specific group of users authorized to access 
a resource; in our system, the client is responsible 
to know and exploit its group memberships as rep- 
resented in delegations [11]. 


5.3.5 Client implementation 


We realize our client as an HTTP proxy that en- 
hances a browser with Snowflake authorization and 
server document-authentication services. Like any 
proxy, it forwards each HTTP request from the 
browser to a server. When a reply is “401 Unau- 
thorized” and requires Snowflake authorization, the 
proxy uses its Prover to find a suitable proof, 
rewrites the request with an Authorization header, 
and retries the request. 

The proxy provides an HTML user _in- 
terface to its services at a virtual URL 
http://security.localhost/. Through this 
interface, the user can create a new private key pair, 
import principal identities and delegations, and 
delegate his authority to others. To delegate his 
authority, the user views a history of recently-visited 
pages, clicks the “delegate” link next to the page 
he wishes to share, and selects the recipient from a 
list of principals. The proxy generates an HTML 
snippet for the user to deliver to the recipient. A 
link inside the snippet names the destination page 
and carries both the delegation from the user as 
well as the proof the user needed to access the page. 
When the recipient follows the link, his own proxy 
imports the authorization information and redirects 
his browser to the named page. 


6 Applications 


We built three applications to demonstrate the 
Snowflake architecture for sharing. 
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6.1 Protected web server 


The first application is simply a protected web file 
server that uses Snowflake’s sharing architecture. 
One user establishes control over the file server by 
specifying the hash of his public key when starting 
up the server; he may delegate to others permission 
to read subtrees or individual files from the server 
using the mechanisms described above. 


6.2 Protected database 


The second application attaches Snowflake security 
to a relational email database. The original database 
server accepts insert, update, and select requests 
as RMI invocations on a Remote Database object, 
and returns the results of the query as serialized 
objects from the database. Adapting the applica- 
tion to Snowflake required only minimal changes. 
We modified the database instance constructor to 
use a SshSocketFactory so that all connections to 
the object use our ssh secure channels. Then, we 
prepended each implementation of a method in the 
remote interface with a call to the checkAuth() 
method. The database clients required only a mod- 
ification to their initialization code to install an 
SSHContext and a Prover. 


6.3 Quoting protocol gateway 


The third application is a protocol gateway that pro- 
vides an HTML over HTTP front-end to the email 
database. A database can be configured to allow 
certain principals access to certain data records. In 
the course of serving multiple users, the gateway can 
simultaneously access both Alice and Bob’s email 
records. It is important that the gateway not mis- 
use its authority and accidentally allow Bob to read 
Alice’s email. The gateway programmer could try 
to prevent this mistake by checking access-control 
restrictions itself, but this approach duplicates the 
access control checks in the database, and increases 
the opportunity for error. 

A better approach is to use quoting. The 
gateway’s authority to access Alice’s email in the 
database depends on the gateway intentionally quot- 
ing Alice in its requests. Therefore, as long as the 
gateway correctly quotes its clients in its requests on 
the database server, the correct access-control deci- 
sion is made by the server. 

A transaction begins when the client (C’) sends 
an unauthorized request (R) to the gateway (G). 
The gateway queries the client for the identity the 
client wishes to use, and a delegation that the gate- 
way speaks for the client to perform the task. The 
gateway attempts to access the database server (S), 


but the RMI authorization fails because the gate- 
way has no authority. The gateway sees an excep- 
tion that indicates the required issuer S and restric- 
tion set (T). The gateway generates a “401 Unau- 
thorized” Snowflake Authorization HTTP response, 
and in that response indicates it needs a proof that 


G|? 4s: By G|? the gateway means it needs 
a proof of authority that the gateway quoting the 
client speaks for the database. The client knows to 
substitute its identity for the “pseudo-principal” ?; 
this shortcut saves a round-trip from the gateway to 
the client to discover the client’s identity. 

The client proxy now knows it needs to delegate its 
authority over the server to the principal “gateway 
quoting client,” G|C. The client proxy generates 
the proof and submits it to the gateway along with a 
signed copy of its original request (showing R > C). 
The gateway digests the new proof and forwards the 
request to the database server. This time, the au- 
tomatic RMI authorization protocol of Section 5.1.1 
finds the proof in the gateway’s Prover, and the 
database fulfills the request. The gateway builds an 
HTML interface from the database results for pre- 
sentation to the user. Subsequent requests are ac- 
cepted without so much fanfare, since the database 
server holds the appropriate proof of delegation. 

The quoting gateway is a motivating application 
because it spans each of the four boundaries dis- 
cussed in Section 2. Our gateway operates identi- 
cally whether the client and the server are in the 
same administrative domain or different ones. It can 
be colocated with the server, in which case its RMI 
transactions automatically avoid encryption over- 
head by using the local channels of Section 5.2. The 
gateway constructs a view of an e-mail message from 
several rows and tables of a relational database, and 
so introduces a level of abstraction above the server 
resource. Finally, the gateway spans protocols by 
connecting an HTTP-speaking web browser with an 
RMI-speaking database server. Despite each of these 
boundaries, the gateway preserves the entire chain of 
authority that connects the client to the final server, 
enabling the server to make a fully-informed access- 
control decision. 


6.3.1 Correctness and trust 


The client trusts the gateway not to abuse the 
client’s authority, and for some applications, the 
client may even trust the gateway to tell it how 
much authority the gateway needs to do its job. To 
establish that trust, the a client might first chal- 
lenge the gateway to authenticate itself. If a gate- 
way has received delegated authority from multiple 
clients (Alice and Bob), it must ensure that when 
it fulfills Bob’s request it does not accidentally in- 
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voke Alice’s authority. Where a conventional gate- 
way would actually make access-control decisions to 
determine what Bob is allowed to do, our gateway 
only need be careful to correctly quote each client. 
It is therefore easier to verify that a quoting gateway 
is correct with respect to authorization. 

In our system the notion of TCB is parameterized 
by the resource being protected. For example, the 
client software and hardware are part of the TCB 
for any resources the client is authorized to manip- 
ulate; when the client delegates a subset of those 
resources to the gateway, the gateway software and 
hardware become part of the TCB for that subset 
of resources. Although quoting helps us write the 
gateway application with greater confidence in its 
correctness, we cannot escape the fact that a com- 
promised gateway still compromises the resources 
delegated to the gateway. Because the gateway is 
involved in the transfer of authority, authorization is 
not end-to-end in the pure sense of abstracting away 
intermediate steps. It is end-to-end, however, in the 
sense that authorization information now passes all 
the way from client to server, and the proof of au- 
thority verified by the server even includes evidence 
of the gateway principal’s involvement. 


7 Measurement 


To better understand the costs of the Snowflake au- 
thorization model, and how they compare to costs 
of related systems, we timed the performance of our 
Snowflake-enhanced RMI implementation and our 
Snowflake-enhanced HTTP implementation. For 
comparison, we also timed standard RMI and stan- 
dard HTTP servers with and without SSL support. 


7.1 Experimental method 


The values reported in this section are the param- 
eters of linear regressions. In setup cost and band- 
width experiments, we vary the file length to sepa- 
rate copy cost from connection setup. In setup and 
per-request experiments, we vary the number of con- 
nections made after some slow setup operation to 
determine the amortizable part of the cost. 

We made the measurements on 270 MHz Sun UI- 
tra 5 hosts with 128 MB RAM, connected by a 
shared 10 Mbps Ethernet segment. The hosts run 
Solaris 2.7, Apache 1.3.12, OpenSSL 0.9.5, a locally- 
compiled Java JDK 1.2.2 with green threads, 
PureTLS 0.9b1, and Cryptix 3.1.1. We used 1024- 
bit RSA keys. 

We ran each experiment ten times, discarding the 
first iteration so that caches are warm except where 
we intentionally measure setup costs. On each run, 
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we repeated an operation 10 to 1000 times, enough 
to amortize measurement overhead, and noted the 
total wall-clock time. When the nine runs had co- 
efficient of variation greater than 0.1, we re-ran the 
experiment. We report values to two significant fig- 
ures. The figures show values for single-machine ex- 
periments, where computation time, the dominant 
source of overhead, cannot hide under network la- 
tency. The raw data, complete tables of computed 
parameters, standard deviations and R? fitness coef- 
ficients are available [12, Chapter 12]. We computed 
95% confidence intervals on the linear-regression pa- 
rameters and found them vanishingly small. 


7.2 RMI authorization with 
Snowflake 


In this section, we quantify our implementation of 
Snowflake authorization over Java remote method 
invocation as described in Section 5.1. Figure 6 sum- 
marizes the overhead our prototype adds to RMI. 
The test operation is a Remote object that returns 
the contents of a file. Most of the overhead present 
in Snowflake is due to layering RMI over the ssh pro- 
tocol. The extra work is is the server’s checkAuth() 
call, which retrieves the caller’s public key, finds a 
cached proof for that subject, and sees that the proof 
has already been verified. The data-copy cost is un- 
changed compared to the ssh case. 
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Figure 6: The cost of introducing Snowflake authoriza- 
tion to RMI. A basic RMI call costs 4.8 ms. Securing the 
channel with ssh introduces significant overhead. Map- 
ping the request into Snowflake and verifying the client’s 
authority adds another 5 ms. 


It costs 470 ms to establish a new Snowflake- 
authorized RMI connection, reflecting the public-key 
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operation the client performs to delegate its author- 
ity to the channel. When the client caches the dele- 
gation but we make the server forget its copy after 
each use, we learn that the server spends 190 ms 
parsing and verifying the proof from the client. 


7.3 HTTP authorization with 
Snowflake 


In this section, we quantify our implementation of 
Snowflake authorization over the HTTP protocol) as 
described in Section 5.3. As shown in Figure 7, the 
overhead of Java client and server code introduces a 
five-fold slowdown over an optimized C implementa- 
tion of HTTP. Most of the rest of Snowflake’s slow- 
down we have accounted for in the slow libraries de- 
scribed in Section 7.4.3. 
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Figure 7: The cost of introducing Snowflake authoriza- 
tion to HTTP. A trivial C client accessing an Apache 
server takes 4.6 ms. Replacing the client and server with 
convenient but inefficient Java packages brings the base- 
line for HTTP to 25 ms. Most of Snowfiake’s overhead 
reflects the use of inefficient SPKI libraries, shown as an 
inset bor. 


The black bars in Figure 8 show our measurements 
of a Java SSL implementation, and the gray and 
white bars show the costs of the Snowflake autho- 
rization and document authentication protocols de- 
scribed in Section 5.3. Notice that when public-key 
encryption operations are involved, both protocols 
require hundreds of milliseconds. When caching con- 
nection information (Snofwflake MAC protocol and 
identical requests versus a SSL request), they require 
tens of milliseconds. Snowflake’s cached requests are 
a factor of two slower than SSL requests, due in part 
to differences in the protocol, and in part to the slow 
libraries discussed in Section 7.4.3. 


Minimum cost of HTTP GET 5 5 
(C client and server) 
Java+Jetty overhead for HTTP 20 20 
Java SSL overhead 22 
S-expression parsing =20 
SPKI object unmarshalling ~20 
Other Snowflake overhead 17 
(proof verification, 
SPKI object marshalling) 


MAC costs 28 
(serialization, MD5 hash) 
Total 47 110 


Table 1: Breakdown of time spent in MAC authorization 
protocol. Units are milliseconds. 


7.4 Observations 


We hypothesize that the Snowflake authorization 
model is not prohibitively expensive. In fact, be- 
cause it can subsume many hop-by-hop authoriza- 
tion models, it allows applications and users to make 
performance-security tradeoffs freely by selecting 
alternate hop-by-hop authorization protocols and 
plugging them into the same authorization frame- 
work. 

Do our measurements support our hypothesis? 
Unfortunately, since our implementation is unopti- 
mized and built on top of slow libraries, the num- 
bers do not support our hypothesis unequivocally. 
By comparing them with baseline experiments, how- 
ever, we believe we can make a strong case for the 
hypothesis. In the next two sections, we examine 
the two parts of our hypothesis. In Section 7.4.3, we 
argue that an optimized Snowflake promises to be 
competitive with existing hop-by-hop protocols. 


7.4.1 Comparable operations 


Snowflake-enhanced protocols are not inherently 
more expensive than other protocols with similar 
guarantees. The measurements displayed in Figure 8 
indicate that Snowflake performs similar encryption 
steps as SSL. SSL spends about 400 ms starting up, 
as does Snowflake. SSL can complete a request over 
an established channel in about 50 ms. With our 
MAC optimization, a Snowflake request takes about 
110 ms (see Table 1). 

Both SSL and Snowflake engage in similar op- 
erations. SSL verifies message authenticity with 
symmetric-key decryption and a CRC; Snowflake 
does the same with an MD5 hash. Regardless of 
protocol, the server parses and processes the request 
and returns the reply. The SSL protocol checksums 
and encrypts the reply; Snowflake securely hashes 
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Figure 8: This graph displays the costs of standard SSL authentication (black bars) versus Snowflake client autho- 
rization (gray bars) and server document authentication (white bars). 


the reply document. In both cases, the client uses 
a corresponding operation to verify the reply. Be- 
cause the expensive cryptographic operations are 
comparable, one expects optimized implementations 
to perform comparably. 

The additional sources of overhead in Snowflake 
are time spent walking the proof graph and memory 
consumed maintaining cached proofs. Our experi- 
ments do not explore that space in depth, but as we 
hint in Section 5.3.5, proofs are usually constructed 
incrementally while walking the name graph, an op- 
eration driven by the client user or application. 


7.4.2 The performance-—security tradeoff 


By comparing our authorized-request protocol to 
SSL we somewhat compare apples and oranges, for 
the protocols make different performance-security 
tradeoffs. For example, our protocol does not ver- 
ify the authenticity of the server’s reply header; 
since SSL provides integrity for the entire channel, 
a Snowflake-SSL protocol could as easily show the 
authenticity of all messages from the server. 

In fact, part of the purpose of our system is 


to enable such tradeoffs. With Snowflake, one is 
free to choose an established hop-by-hop protocol 
or to develop a new one. By stating in our logic 
the authorization promises the protocol makes, one 
can integrate the protocol into Snowflake’s end-to- 
end authorization model. Conceivably, new pro- 
tocols can be dynamically integrated into exist- 
ing Snowflake-aware applications; in other cases, a 
protocol-translating gateway can introduce the new 
protocol to the distributed system without hiding 
authorization information from the underlying ap- 
plication. 


7.4.3 Slow libraries 


Our formal measurements and informal tests indi- 
cate that a large fraction of Snowflake’s cost is need- 
less overhead. Our baseline HT'TP measurements 
indicate that using Java and the convenient Jetty 
web server incurs substantial overhead (250%). Fur- 
thermore, our SSL measurements indicate that the 
Java encryption library Cryptix imposes a substan- 
tial bandwidth overhead. 

What surprised us most was the overhead of the 
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SPKI implementation on which we built Snowflake’s 
objects. In informal tests, parsing a 2 KB S- 
expression from a string takes around 20 ms, and 
converting the resulting tree into typed Java objects 
takes another 20 ms. There is no reason a well- 
implemented library should spend milliseconds pars- 
ing short strings in a simple language; and 40-+ ms 
delays such as these explain much of the difference 
between Snowflake’s warm-connection performance 
and that of simple HTTP transactions (See Fig- 
ure 7). 


8 Related work 


Our work is built primarily on the Logic of Authen- 
tication due to Abadi, Burrows, Lampson, Plotkin, 
and Wobber [1, 13, 25]. The Logie of Authentica- 
tion introduced the notion of conjunct and quoting 
principals, and their applicability for modeling prac- 
tical mechanisms such as channels and multiplexed 
gateways. We have preserved the generality and for- 
mality of the Logic of Authentication while introduc- 
ing the crucial feature of restricted delegation. The 
structure of our implementation is similar to that of 
Taos, but we generally shift the burden of proof to 
the client so that the collection of access-control in- 
formation happens in the course of name resolution 
as described in Section 4.4. 

Sollins describes the restricted delegation problem 
as “cascaded authentication,” and proposes as a so- 
lution a restricted delegation mechanism called pass- 
ports [21] that provides for authorization of servers. 
Varadharajan et al. propose a more general mech- 
anism that incorporates both symmetric and asym- 
metric encryption [23]. Neuman’s prozies are tokens 
that express restricted delegation [17]. The Policy- 
Maker system has a notion of delegations with re- 
strictions specified by arbitrary code [5]. As we men- 
tion in Section 3, SPKI has a notion of restricted 
delegation close to the ore we use. Because the only 
principals in SPKI are public keys, it has high over- 
head for authorization on a single machine [9]. 

Sollins’ passports, Neuman’s proxies, Policy- 
Maker, and SPKI certificates are mechanisins with 
only informally-described semantics, and hence have 
no obvious and safe route to generalization. As we 
discuss in the companion paper [11], our formal se- 
mantics not only provides intuition for restricted del- 
egation and end-to-end authorization, but it can ad- 
vise us about the safety of possible extensions. Fur- 
thermore, it guides us in building a system with a 
minimal verification engine. 

Appel and Felten’s higher-order predicate logic is 
similarly inspired and applicable to SPKI [3]. Be- 
cause our logic is a first-order propositional modal 


logic, we can employ a conventional modal-logic se- 
mantics [11]. Our logic is also simpler; we factor im- 
plementation details out of the logic and leave only 
the structure of authorization. For example, con- 
cepts such as “digital signature” do not appear in 
our proof rules, instead, we integrate them by map- 
ping a key to a logical principal, and asserting that 
a digital signature check validates the logical state- 
ment K says 2. 

Several single-machine operating systems have 
been built on the notion of restricted delegation; 
these are often called capability-based systems. Ca- 
pabilities in KeyKOS, Eros, and Mach are unforge- 
able because the kernel manages them. A process 
delegates its authorization by asking the kernel to 
pass a capability, possibly with restriction, to an- 
other process [6, 20, 4]. Amoeba capabilities, in 
contrast, are secret random numbers, and may be 
transmitted as raw data [22, 16]. Amoeba must as- 
sume that a cluster is a secure network; we con- 
sider such a cluster a single administrative domain. 
Snowflake end-to-end authorization could integrate 
either sort of capability implementation as a fast, 
local authorization mechanism. 


9 Summary and future work 


We make a case for end-to-end authorization. Our 
proposal is based on a formal logic that models re- 
stricted delegations and hence models several exist- 
ing hop-by-hop protocols. We describe the infras- 
tructure of Snowflake, our implementation, includ- 
ing two hop-by-hop protocols and applications that 
exploit its end-to-end nature. Our end-to-end ap- 
proach lets us connect systems with gateways that 
preserve authorization information, and by integrat- 
ing multiple hep-by-hop mechanisms, it gives us free- 
dom to easily trade off performance and security. 
We would like to cross our work on end-to-end au- 
thorization with work on models of secrecy and in- 
formation flow, to work toward an end-to-end model 
that can capture notions of who should know what. 
In such an architecture we imagine a gateway that 
operates with only partial access to the information 
it translates, passing from server to client encrypted 
content that it need not view to accomplish its task. 
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Abstract 


Self-securing storage prevents intruders from unde- 
tectably tampering with or permanently deleting 
stored data. To accomplish this, self-securing stor- 
age devices internally audit ail requests and keep 
oid versions of data for a window of time, regard- 
less of the commands received from potentially com- 
promised host operating systems. Within the win- 
dow, system administrators have this valuable in- 
formation for intrusion diagnosis and recovery. Our 
implementation, called $4, combines log-structuring 
with journal-based metadata to minimize the per- 
formance costs of comprehensive versioning. Exper- 
iments show that self-securing storage devices can 
deliver performance that is comparable with conven- 
tional storage systems. In addition, analyses indi- 
cate that several weeks worth of all versions can rea- 
sonably be kept on state-of-the-art disks, especially 
when differencing and compression technologies are 
employed. 


1 Introduction 


Despite the best efforts of system designers and im- 
plementors, it has proven difficult to prevent com- 
puter security breaches. This fact is of growing im- 
portance as organizations find themselves increas- 
ingly dependent on wide-area networking (providing 
more potential sources of intrusions) and computer- 
maintained information (raising the significance of 
potential damage). A successful intruder can obtain 
the rights and identity of a legitimate user or admin- 
istrator. With these rights, it is possible to disrupt 
the system by accessing, modifying, or destroying 
critical data. 


Even after an intrusion has been detected and termi- 
nated, system administrators still face two difficult 
tasks: determining the damage caused by the intru- 
sion and restoring the system to a safe state. Dam- 
age includes compromised secrets, creation of back 
doors and Trojan horses, and tainting of stored data. 
Detecting each of these is made difficult by crafty in- 
truders who understand how to scrub audit logs and 


disrupt automated tamper detection systems. Sys- 
tem restoration involves identifying a clean backup 
(i.e., one created prior to the intrusion), reinitializ- 
ing the system, and restoring information from the 
backup. Such restoration often requires a signifi- 
cant amount of time, reduces the availability of the 
original system, and frequently causes loss of data 
created between the safe backup and the intrusion. 


Self-securing storage offers a partial solution to these 
problems by preventing intruders from undetectably 
tampering with or permanently deleting stored data. 
Since intruders can take the identity of real users and 
even the host OS, any resource controlled by the op- 
erating system is vulnerable, including the raw stor- 
age. Rather than acting as slaves to host OSes, self- 
securing storage devices view them, and their users, 
as questionable entities for which they work. These 
self-contained, self-controlled devices internally ver- 
sion all data and audit all requests for a guaranteed 
amount of time (e.g., a week or a month), thus pro- 
viding system administrators time to detect intru- 
sions. For intrusions detected within this window, 
all of the version and audit information is available 
for analysis and recovery. The critical difference be- 
tween self-securing storage and host-controlled ver- 
sioning (e.g., Elephant [29]) is that intruders can no 
longer bypass the versioning software by compromis- 
ing complex OSes or their poorly-protected user ac- 
counts. Instead, intruders must compromise single- 
purpose devices that export only a simple storage 
interface, and in some configurations, they may have 
to compromise both. 


This paper describes self-securing storage and our 
implementation of a self-securing storage server, 
called S4. A number of challenges arise when stor- 
age devices distrust their clients. Most importantly, 
it may be difficult. to keep all versions of all data for 
an extended period of time, and it is not acceptable 
to trust the client to specify what is important to 
keep. Fortunately, storage densities increase faster 
than most computer characteristics (100%+ per an- 
num in recent years). Analysis of recent workload 
studies [29, 34] suggests that it is possible to ver- 
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sion all data on modern 30-100GB drives for several 
weeks. Further, aggressive compression and cross- 
version differencing techniques can extend the intru- 
sion detection window offered by self-securing stor- 
age devices. Other challenges include efficiently en- 
coding the many metadata changes, achieving secure 
administrative control, and dealing with denial-of- 
service attacks. 


The S4 system addresses these challenges with a 
new storage management structure. Specifically, $4 
uses a log-structured object system for data ver- 
sions and a novel journal-based structure for meta- 
data versions. In addition to reducing space utiliza- 
tion, journal-based metadata simplifies background 
compaction and reorganization for blocks shared 
across many versions. Experiments with S4 show 
that the security and data survivability benefits of 
self-securing storage can be realized with reason- 
able performance. Specifically, the performance of 
S4-enhanced NFS is comparable to FreeBSD’s NFS 
for both micro-benchmarks and application bench- 
marks. The fundamental costs associated with self- 
securing storage degrade performance by less than 
13% relative to similar systems that provide no data 
protection guarantees. 


The remainder of this paper is organized as follows. 
Section 2 discusses intrusion survival and recovery 
difficulties in greater detail. Section 3 describes how 
self-securing storage addresses these issues, identi- 
fies some challenges inherent to self-securing storage, 
and discusses design solutions for addressing them. 
Section 4 describes the implementation of $4. Sec- 
tion 5 evaluates the performance and capacity over- 
heads of self-securing storage. Section 6 discusses 
a number of issues related to self-securing storage. 
Section 7 discusses related work. Section 8 summa- 
rizes this paper’s contributions. 


2 Intrusion Diagnosis and Recovery 


Upon gaining access to a system, an intruder has 
several avenues of mischief. Most intruders attempt 
to destroy evidence of their presence by erasing or 
modifying system log files. Many intruders also in- 
stall back doors in the system, allowing them to gain 
access at will in the future. They may also install 
other software, read and modify sensitive files, or 
use the system as a platform for launching addi- 
tional attacks. Depending on the skill with which 
the intruders hide their presence, there will be some 
detection latency before the intrusion is discovered 
by an automated intrusion detection system (IDS) 
or by a suspicious user or administrator. During this 
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time, the intruders can continue their malicious ac- 
tivities while users continue to use the system, thus 
entangling legitimate changes with those of the in- 
truders. Once an intrusion has been detected and 
discontinued, the system administrator is left with 
two difficult tasks: diagnosis and recovery. 


Diagnosis is challenging because intruders can usu- 
ally compromise the “administrator” account on 
most eperating systems, giving them full control 
over all resources. In particular, this gives them 
the ability to manipulate everything stored on the 
system’s disks, including audit logs, file modifica- 
tion times, and tamper detection utilities. Recov- 
ery is difficult because diagnosis is difficult and be- 
cause user-convenience is an important issue. This 
section discusses intrusion diagnosis and recovery in 
greater detail, and the next section describes how 
self-securing storage addresses them. 


2.1 Diagnosis 


Intrusion diagnosis consists of three phases: detect- 
ing the intrusion, discovering what weaknesses were 
exploited (for future prevention), and determining 
what the intruder did. All are difficult when the 
intruder has free reign over storage and the OS. 


Without the ability to protect storage from compro- 
mised operating systems, intrusion detection may 
be limited to alert users and system administrators 
noticing odd behavior. Examining the system logs 
is the most common approach to intrusion detec- 
tion [7], but when intruders can manipulate the log 
files, such an approach is not useful. Some intrusion 
detection systems also look for changes to important 
system files [16]. Such systems are vulnerable to in- 
truders that can change what the IDS thinks is a 
“safe” copy. 


Determining how an intruder compromised the sys- 
tem is often impossible in conventional systems, be- 
cause he will scrub the system logs. In addition, 
any exploit tools (utilities for compromising com- 
puter systems) that may have been stored on the 
target machine for use in multi-stage intrusions are 
usually deleted. The common “solutions” are to try 
to catch the intruder in the act or to hope that he 
forgot to delete his exploit tools. 


The last step in diagnosing an intrusion is to discover 
what was accessed and modified by the intruder. 
This is difficult, because file access and modifica- 
tion times can be changed and system log files can 
be doctored. In addition, checksum databases are 
of limited use, since they are effective only for static 
files. 
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2.2 Recovery 


Because it is usually not possible to diagnose an 
intruder’s activities, full system recovery generally 
requires that the compromised machine be wiped 
clean and reinstalled from scratch. Prior to erasing 
the entire state of the system, users may insist that 
data, modified since the intrusion, be saved. The 
more effort that went into creating the changes, the 
more motivation there is to keep this data. Unfortu- 
nately, as the size and complexity of the data grows, 
the likelihood that tampering will go unnoticed in- 
creases. Foolproof assessment of the modified data 
is very difficult, and overlooked tampering may hide 
tainted information or a back door inserted by the 
intruder. 


Upon restoring the OS and any applications on the 
system, the administrator must identify a backup 
that was made prior to the intrusion; the most re- 
cent backup may not be usable. After restoring data 
from a pre-intrusion backup, the legitimately mod- 
ified data can be restored to the system, and users 
may resume using the system. This process often 
takes a considerable amount of time—time during 
which users are denied service. 


3 Self-Securing Storage 


Self-securing storage ensures information survival 
and auditing of all accesses by establishing a secu- 
rity perimeter around the storage device. Conven- 
tional storage devices are slaves to host operating 
systems, relying on them to protect users’ data. A 
self-securing storage device operates as an indepen- 
dent entity, tasked with the responsibility of not only 
storing data, but protecting it. This shift of stor- 
age security functionality into the storage device’s 
firmware allows data and audit information to be 
safeguarded in the presence of file server and client 
system intrusions. Even if the OSes of these sys- 
tems are compromised and an intruder is able to 
issue commands directly to the self-securing storage 
device, the new security perimeter remains intact. 


Behind the security perimeter, the storage device 
ensures data survival by keeping previous versions 
of the data. This history pool of old data versions, 
combined with the audit log of accesses, can be used 
to diagnose and recover from intrusions. This sec- 
tion discusses the benefits of self-securing storage 
and several core design issues that arise in realizing 
this type of device. 


3.1 Enabling intrusion survival 


Self-securing storage assists in intrusion recovery by 
allowing the administrator to view audit information 
and quickly restore modified or deleted files. The 
audit and version information also helps to diagnose 
intrusions and detect the propagation of maliciously 
modified data. 


Self-securing storage simplifies detection of an in- 
trusion since versioned system logs cannot be im- 
perceptibly altered. In addition, modified system 
executables are easily noticed. Because of this, self- 
securing storage makes conventional tamper detec- 
tion systems obsolete. 


Since the administrator has the complete picture of 
the system’s state, from intrusion until discovery, it 
is considerably easier to establish the method used 
to gain entry. For instance, the system logs would 
have normally been doctored, but by examining the 
versioned copies of the logs, the administrator can 
see any messages that were generated during the in- 
trusion and later removed. In addition, any exploit 
tools temporarily stored on the system can be recov- 
ered. 


Previous versions of system files, from before the 
intrusion, can be quickly and easily restored by res- 
urrecting them from the history pool. This prevents 
the need for a complete re-installation of the operat- 
ing system, and it does not rely on having a recent 
backup or up-to-date checksums (for tamper detec- 
tion) of system files. After such restoration, critical 
data can be incrementally recovered from the history 
pool. Additionally, by utilizing the storage device’s 
audit log, it is possible to assess which data might 
have been directly affected by the intruder. 


The data protection that self-securing storage pro- 
vides allows easy detection of modifications, selec- 
tive recovery of tampered files, prevention of data 
loss due to out-of-date backups, and speedy recov- 
ery since data need not be loaded from an off-line 
archive. 


3.2 Device security perimeter 


The device’s security model is what makes the abil- 
ity to keep old versions more than just a user con- 
venience. The security perimeter consists of self- 
contained software that exports only a simple stor- 
age interface to the outside world and verifies each 
command’s integrity before processing it. In con- 
trast, most file servers and client machines run a 
multitude of services that are susceptible to attack. 
Since the self-securing storage device is a single- 
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function device, the task of making it secure is much 
easier; compromising its firmware is analogous to 
breaking into an IDE or SCSI disk. 


The actual protocol used to communicate with the 
storage device does not affect the data integrity that 
the new security perimeter provides. The choice of 
protocol does, however, affect the usefulness of the 
audit log in terms of the actions it can record and its 
correctness. For instance, the NFS protocol provides 
no authentication or integrity guarantees, therefore 
the audit log may not be able to accurately link 
a request with its originating client. Nonetheless, 
the principles of self-securing storage apply equally 
to “enhanced” disk drives, network-attached storage 
servers, and file servers. 


For network-attached storage devices (as opposed to 
devices attached directly to a single host system), 
the new security perimeter becomes more useful if 
the device can verify each access as coming from 
both a valid user and a valid client. Such verification 
allows the device to enforce access contro] decisions 
and partially track propagation of tainted data. If 
clients and users are authenticated, accesses can be 
tracked to a single client machine, and the device’s 
audit log can yield the scope of direct damage from 
the intrusion of a given machine or user account. 


3.3. History pool management 


The old versions of objects kept by the device com- 
prise the history pool. Every time an object is 
modified or deleted, the version that existed just 
prior to the modification becomes part of the his- 
tory pool. Eventually an old version will age and 
have its space reclaimed. Because clients cannot be 
trusted to demarcate versions consisting of multiple 
modifications, a separate version should be kept for 
every modification. This is in contrast to versioning 
file systems that generally create new versions only 
when a file is closed. 


A self-securing storage device guarantees a lower 
bound on the amount of time that a deprecated ob- 
ject remains in the history pool before it is reclaimed. 
During this window of time, the old version of the 
object can be completely restored by requesting that 
the drive copy forward the old version, thus mak- 
ing a new version. The guaranteed window of time 
during which an object can be restored is called the 
detection window. When determining the size of this 
window, the administrator must examine the trade- 
off between the detection latency provided by a large 
window and the extra disk space that is consumed 
by the proportionally larger history pool. 
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Although the capacity of disk drives is growing at 
a remarkable rate, it is still finite, which poses two 
problems: 


1. Providing a reasonable detection window in ex- 
ceptionally busy systems. 


2. Dealing with malicious users that attempt to 
fill the history pool. (Note that space ex- 
haustion attacks are not unique to self-securing 
storage. However, device-managed versioning 
makes conventional user quotas ineffective for 
limiting them.) 


In a busy system, the amount of data written could 
make providing a reasonable detection window dif- 
ficult. Fortunately, the analysis in Section 5.2 sug- 
gests that multi-week detection windows can be pro- 
vided in many environments at a reasonable cost. 
Further, aggressive compression and differencing of 
old versions can significantly extend the detection 
window. 


Deliberate attempts to overflow the history pool can- 
not be prevented by simply increasing the space 
available. As with most denial of service attacks, 
there is no perfect solution. There are three flawed 
approaches to addressing this type of abuse. The 
first is to have the device reclaim the space held by 
the oldest objects when the history pool is full. Un- 
fortunately, this would allow an intruder to destroy 
information by causing its previous instances to be 
reclaimed from the overflowing history pool. The 
second flawed approach is to stop versioning objects 
when the history pool fills; although this will allow 
recovery of old data, system administrators would 
no longer be able to diagnose the actions of an in- 
truder or differentiate them from subsequent legiti- 
mate changes. The third flawed approach is for the 
drive to deny any action that would require addi- 
tional versions once the history pool fills; this would 
result in denial of service to all users (legitimate or 
not). 


Our hybrid approach to this problem is to try to pre- 
vent the history pool from being filled by detecting 
probable abuses and throttling the source machine’s 
accesses. ‘This allows human intervention before 
the system is forced to choose from the above poor 
alternatives. Selectively increasing latency and/or 
decreasing bandwidth allows well-behaved users to 
continue to use the system even while it is under at- 
tack. Experience will show how well this works in 
practice. 


Since the history pool will be used for intrusion di- 
agnosis and recovery, not just recovering from acci- 


USENIX Association 


USENIX Association 


dental destruction of data, it is difficult to construct 
a safe algorithm that would save space in the his- 
tory pool by pruning versions within the detection 
window. Almost any algorithm that selectively re- 
moves versions has the potential to be abused by 
an intruder to cover his tracks and to successfully 
destroy /modify information during a break-in. 


3.4 History pool access control 


The history pool contains a wealth of information 
about the system’s recent activity. This makes ac- 
cessing the history pool a sensitive operation, since 
it allows the resurrection of deleted and overwrit- 
ten objects. This is a standard problem posed by 
versioning file systems, but is exacerbated by the 
inability to selectively delete versions. 


There are two basic approaches that can be taken 
toward access control for the history pool. The first 
is to allow only a single administrative entity to have 
the power to view and restore items from the history 
pool. This could be useful in situations where the 
old data is considered to be highly sensitive. Having 
a single tightly-controlled key for accessing historical 
data decreases the likelihood of an intruder gaining 
access to it. Although this improves security, it pre- 
vents users from being able to recover from their 
own mistakes, thus consuming the administrator’s 
time to restore users’ files. The second approach 
is to allow users to recover their own old objects 
(in addition to the administrator). This provides 
the convenience of a user being able to recover their 
deleted data easily, but also allows an intruder, who 
obtains valid credentials for a given user, to recover 
that user’s old file versions. 


Our compromise is to allow users to selectively make 
this decision. By choice, a user could thus delete an 
object, version, or all versions from visibility by any- 
one other than the administrator, since permanent 
deletion of data via any other method than aging 
would be unsafe. This choice allows users to en- 
joy the benefits of versioning for presentations and 
source code, while preventing access to visible ver- 
sions of embarrassing images or unsent e-mail drafts. 


3.5 Administrative access 


A method for secure administrative access is needed 
for the necessary but dangerous commands that a 
self-securing storage device must support. Such 
commands include setting the guaranteed detection 
window, erasing parts of the history pool, and ac- 
cessing data that users have marked as “unrecover- 
able.” Such administrative access can be securely 


granted in a number of ways, including physical ac- 
cess (e.g., flipping a switch on the device) or well- 
protected cryptographic keys. 


Administrative access is not necessary for users at- 
tempting to recover their own files from accidents. 
Users’ accesses to the history pool should be han- 
dled with the same form of protection used for their 
normal accesses. This is acceptable for user activity, 
since all actions permitted for ordinary users can be 
audited and repaired. 


3.6 Version and administration tools 


Since self-securing storage devices store versions of 
raw data, users and administrators will need assis- 
tance in parsing the history pool. Tools for travers- 
ing the history must assist by bridging the gap be- 
tween standard file interfaces and the raw versions 
that are stored by the device. By being aware of 
both the versioning system and formats of the data 
objects, utilities can present interfaces similar to 
that of Elephant [29], with “time-enhanced” versions 
of standard utilities such as 1s and cp. This is ac- 
complished by extending the read interfaces of the 
device to include an optional time parameter. When 
this parameter is specified, the drive returns data 
from the version of the object that was valid at the 
requested time. 


In addition to providing a simple view of data ob- 
jects in isolation, intrusion diagnosis tools can utilize 
the audit log to provide an estimate of damage. For 
instance, it is possible to see all files and directo- 
ries that a client modified during the period of time 
that it was compromised. Further estimates of the 
propagation of data written by compromised clients 
are also possible, though imperfect. For example, 
diagnosis tools may be able to establish a link be- 
tween objects based on the fact that one was read 
just before another was written. Such a link between 
a source file and its corresponding object file would 
be useful if a user determines that a source file had 
been tampered with; in this situation, the object file 
should also be restored or removed. Exploration of 
such tools will be an important area of future work. 


4 S4 Implementation 


S4 is a self-securing storage server that transpar- 
ently maintains an efficient object-versioning system 
for its clients. It aims to perform comparably with 
current systems, while providing the benefits of self- 
securing storage and minimizing the corresponding 
space explosion. 
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RPC Type Allows Description 
Time-Based 

Create Create an object 
Delete fro Delete an object _—SCS~S 
Read = 
Write fnot—t—idY Write data to an object 

[Append [no | Append data to the end ofan object SSS 
GetAttr Get the attributes of an object (S4-specific and opaque) 

Setter fro | Set the opaque attributes of an object Sid 
[cetactaytser | yes | Get an ACL entry for an object given a specific Vaert0_ | 
GetACLByIndex yes Get an ACL entry for an object by its index in the object’s ACL 

Po eee 
SetAGL fo [Set an ACL entry foran object SSS 
PCreate fro | Greate a partition (assodate a name with an ObjectID) | 
PDelete no | Delete a partition (remove a name/ObjectTD association) | 
[pte [yes] Uist the partitions SCS 
PMount Retrieve the ObjectID given its name - 
Sync Sync the entire cache to disk 


Flush not applicable | Removes all versions of all objects between two times 


| FlushO not applicable | Removes all versions of an object between two times 
not applicable | Adjusts the guaranteed detection window of the S4 device 


Table 1: S4 Remote Procedure Call List — Operations that support time-based access accept a time in addition to the 
normal parameters; this time is used to find the appropriate version in the history pool. Note that all modifications create new 


versions without affecting the previous versions. 


4.1 A self-securing object store 


S4 is a network-attached object store with an in- 
terface similar to recent object-based disk propos- 
als [9, 24]. This interface simplifies access control 
and internal performance enhancement relative to a 
standard block interface. 


In S4, objects exist in a flat namespace managed 
by the “drive” (i.e., the object store). When ob- 
jects are created, they are given a unique identifier 
(Obj ectID) by the drive, which is used by the client 
for all future references to that object. Each object 
has an access control structure that specifies which 
entities (users and client machines) have permission 
to access the object. Objects also have metadata, file 
data, and opaque attribute space (for use by client 
file systems) associated with them. 


To enable persistent mount points, a $4 drive sup- 
ports “named objects.” The object names are an 
association of an arbitrary ASCII string with a par- 
ticular ObjectID. The table of named objects is im- 
plemented as a special S4 object accessed through 
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dedicated partition manipulation RPC calls. This 
table is versioned in the same manner as other ob- 
jects on the S4 drive. 


4.1.1 S4 RPC interface 


Table 1 lists the RPC commands supported by the 
S4 drive. The read-only commands (read, getattr, 
getacl, plist,and pmount) accept an optional time 
parameter. When the time is provided, the drive 
performs the read request on the version of the ob- 
ject that was “most current” at the time specified, 
provided that the user making the request has suffi- 
cient privileges. 


The ACLs associated with objects have the tradi- 
tional set of flags, with one addition—the Recovery 
flag. The Recovery flag determines whether or not a 
given user may read (recover) an object version from 
the history pool once it is overwritten or deleted. 
When this flag is clear, only the device administrator 
may read this object version once it is pushed into 
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the history pool. The Recovery flag allows users to 
decide the sensitivity of old versions on a file-by-file 
basis. 


4.1.2 S4/NFS translation 


Since one goal of self-securing storage is to provide 
an enhanced level of security and convenience on 
existing systems, the prototype minimizes changes 
to client systems. In keeping with this philosophy, 
the $4 drive is network-attached and an “S4 client” 
daemon serves as a user-level file system transla- 
tor (Figure la). The S4 client translates requests 
from a file system on the target OS to S4-specific 
requests for objects. Because it runs as a user-level 
process, without operating system modifications, the 
S4 client should port to different systems easily. 


The S4 client currently has the ability to trans- 
late NFS version 2 requests to S4 requests. The 
S4 client appears to the local workstation as a NFS 
server. This emulated NFS server is mounted via 
the loopback interface to allow only that worksta- 
tion access to the S4 client. The client receives the 
NFS requests and translates them into S4 opera- 
tions. NFSv2 was chosen over version 3 because its 
client is well-supported within Linux, and its lack of 
write caching allows the drive to maintain a detailed 
account of client actions. 


Figure 1 shows two approaches to using the S4 client 
to serve NFS requests with the S4 drive. The first 
places the S4 client on the client system, as described 
previously, and uses the S4 drive as a network- 
attached storage device. The second incorporates 
the S4 client functionality into the server, as a NFS- 
to-S4 translator. This configuration acts as a S4- 
enhanced NFS server (Figure 1b) for normal file sys- 
tem activity, but recovery must still be accomplished 
through the S4 protocol since the NFS protocol has 
no notion of “time-based” access. 


The implementation of the NFS file system overlays 
files and directories on top of $4 objects. Objects 
used as directories contain a list of ASCII filenames 
and their associated NFS file handles. Objects used 
as files and symlinks contain the corresponding data. 
The NFS attribute structure is maintained within 
the opaque attribute space of each object. 


When the S4 client receives a NFS request, the NFS 
file handle (previously constructed by the S4 client) 
can be directly hashed into the ObjectID of the di- 
rectory or file. The S4 client can then make requests 
directly to the drive for the desired data. 


Client 
Application 


Client 
(a) Baseline S4 (network-attached object store) 


S4-NFS 
Translator 
NFS 





Client 
(b) S4-enhanced NFS server 





Figure 1: Two S4 Configurations — This figure shows two 
S4 configurations that provide self-securing storage via a NFS 
interface. (a) shows S4 as a network-attached object store 
with the S4 client daemon translating NFS requests to S4- 
specific RPCs. (b) shows a self-securing NFS server created 
by combining the NFS-to-S4 translation and the S4 drive. 


To support NFSv2 semantics, the client sends an ad- 
ditional RPC to the drive to flush buffered writes to 
the disk at the end of each NFS operation that mod- 
ifies the state of one or more objects. Since this RPC 
does not return until the synchronization is com- 
plete, NFSv2 semantics are supported even though 
the drive normally caches writes. 


Because the client overlays a file system on top of 
the flat object namespace, some file system oper- 
ations require several drive operations (and hence 
RPC calls). These sets of operations are analogous 
to the operations that file systems must perform 
on block-based devices. To minimize the number 
of RPC calls necessary, the S4 client aggressively 
maintains attribute and directory caches (for reads 
only). The drive also supports batching of setattr, 
getattr, and sync operations with create, read, 
write, and append operations. 


4.2 S4 drive internals 


The main goals for the 54 drive implementation 
are to avoid performance overhead and to minimize 
wasted space, while keeping all versions of all objects 
for a given period of time. Achieving these goals re- 
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quires a combination of known and novel techniques 
for organizing on-disk data. 


4.2.1 Log-structuring for efficient writes 


Since data within the history pool cannot be over- 
written, the $4 drive uses a log structure similar to 
LFS [27]. This structure allows multiple data and 
metadata updates to be clustered into fewer, larger 
writes. Importantly, it also obviates any need to 
move previous versions before writing. 


In order to prune old versions and reclaim unused 
segments, $4 includes a background cleaner. While 
the goal of this cleaner is similar to that of the LFS 
cleaner, the design must be slightly different. Specif- 
ically, deprecated objects cannot be reclaimed unless 
they have also aged out of the history pool. There- 
fore, the S4 cleaner searches through the object map 
for objects with an oldest time greater than the de- 
tection window. Once a suitable object is found, 
the cleaner permanently frees all data and meta- 
data older than the window. If this clears all of 
the resources within a segment, the segment can be 
marked as free and used as a fresh segment for fore- 
ground activity. 


4.2.2 Journal-based metadata 


To efficiently keep all versions of object metadata, 
S4 uses journal-based metadata, which replaces most 
instances of metadata with compact journal entries. 


Because clients are not trusted to notify S4 when 
objects are closed, every update creates a new ver- 
sion and thus new metadata. For example, when 
data pointed to by indirect blocks is modified, the 
indirect blocks must be versioned as well. In a con- 
ventional versioning system, a single update to a 
triple-indirect block could require four new blocks 
as well as a new inode. Early experiments with this 
type of versioning system showed that modifying a 
large file could cause up to a 4x growth in disk us- 
age. Conventional versioning file systems avoid this 
performance problem by only creating new versions 
when a file is closed. 


In order to significantly reduce these problems, S4 
encodes metadata changes in a journal that is main- 
tained for the duration of the detection window. By 
persistently keeping journal entries of all metadata 
changes, metadata writes can be safely delayed and 
coalesced, since individual inode and indirect block 
versions can be recreated from the journal. To avoid 
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Figure 2: Efficiency of Metadata Versioning — The 
above figure compares metadata management in a conven- 
tional versioning system to S4’s journal-based metadata ap- 
proach. When writing to an indirect block, a conventional 
versioning system allocates a new data block, a new indirect 
block, and a new inode. Also, the identity of the new in- 
ode must be recorded (e.g., in an Elephant-like inode log). 
With journal-based metadata, a single journal entry suffices, 
pointing to both the new and old data blocks. 


rebuilding an object’s current state from the jour- 
nal during normal operation, an object’s metadata 
is checkpointed to a log segment before being evicted 
from the cache. Unlike conventional journaling, such 
checkpointing does not prune journal space; only ag- 
ing may prune space. Figure 2 depicts the difference 
in disk space usage between journal-based metadata 
and conventional versioning when writing data to an 
indirect data block. 


In addition to the entries needed to describe meta- 
data changes, a checkpoint entry is needed. This 
checkpoint entry denotes writing a consistent copy 
of all of an object’s metadata to disk. It is necessary 
to have at least one checkpoint of an object’s meta- 
data on disk at all times, since this is the starting 
point for all time-based and crash recovery recre- 
ations. 


Storing an object’s changes within the log is done us- 
ing journal sectors. Each journal sector contains the 
packed journal entries that refer to a single object’s 
changes made within that segment. The sectors are 
identified by segment summary information. Jour- 
nal sectors are chained together backward in time to 
allow for version reconstruction. 


Journal-based metadata can also simplify cross- 
version differential compression [3]. Since the blocks 
changed between versions are noted within each en- 
try, it is easy to find the blocks that should be com- 
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pared. Once the differencing is complete, the old 
blocks can be discarded, and the difference left in 
its place. For subsequent reads of old versions, the 
data for each block must be recreated as the en- 
tries are traversed. Cross-version differencing of old 
data will often be effective in reducing the amount 
of space used by old versions. Adding differencing 
technology into the S4 cleaner is an area of future 
work. 


4.2.3 Audit log 


In addition to maintaining previous object versions, 
S4 maintains an append-only audit log of all re- 
quests. This log is implemented as a reserved object 
within the drive that cannot be modified except by 
the drive itself. However, it can be read via RPC op- 
erations. The data written to the audit log includes 
command arguments as well as the originating client 
and user. All RPC operations (read, write, and ad- 
ministrative) are logged. Since the audit log may 
only be written by the drive front end, it need not 
be versioned, thus increasing space efficiency and de- 
creasing performance costs. 


5 Evaluation of self-securing storage 


This section evaluates the feasibility of self-securing 
storage. Experiments with S4 indicate that compre- 
hensive versioning and auditing can be performed 
without a significant performance impact. Also, es- 
timates of capacity growth, based on reported work- 
load characterizations, indicate that history win- 
dows of several weeks can easily be supported in 
several real environments. 


5.1 Performance 


The main performance goal for S4 is to be compara- 
ble to other networked file systems while offering en- 
hanced security features. This section demonstrates 
that this goal is achieved and also explores the over- 
heads specifically associated with self-securing stor- 
age features. 


5.1.1 Experimental Setup 


The four systems used in the experiments had the 
following configurations: (1) a S4 drive running 
on RedHat 6.1 Linux communicating with a Linux 
client over S4 RPC through the S4 client module 
(Figure 1a), (2) a S4-enhanced NFS server running 


on RedHat 6.1 Linux communicating with a Linux 
client over NFS (Figure 1b), (3) a FreeBSD 4.0 
server communicating with a Linux client over NFS, 
and (4) a RedHat 6.1 Linux server communicating 
with a Linux client over NFS. Since Linux NFS does 
not comply with the NFSv2 semantics of commit- 
ting data to stable storage before operation com- 
pletion, the Linux server’s file system was mounted 
synchronously to approximate NFS semantics. In all 
cases, NFS was configured to use 4KB read/write 
transfer sizes, the only option supported by Linux. 
The FreeBSD NFS configuration exports a BSD FFS 
file system, while the Linux NFS configuration ex- 
ports an ext2 file system. All experiments were run 
five times and have a standard deviation of less than 
3% of the mean. The S4 drives were configured with 
a 128MB buffer cache and a 32MB object cache. The 
Linux and FreeBSD NFS servers’ caches could grow 
to fill local memory (512MB). 


In all experiments, the client system has a 550MHz 
Pentium III, 128MB RAM, and a 3Com 3C905B 
100Mb network adapter. The servers have a 6(00MHz 
Pentium III, 512MB RAM, a 9GB 10,000RPM UI- 
tra2 SCSI Seagate Cheetah drive, an Adaptec AIC- 
7896/7 Ultra2 SCSI controller, and an Intel Ether- 
Express Prol00 100Mb network adapter. The client 
and server are on the same subnet and are connected 
by a 100Mb network switch. All versions of Linux 
use an unmodified 2.2.14 kernel, and the BSD sys- 
tem uses a stock FreeBSD 4.0 installation. 


To evaluate performance for common workloads, 
results from two application benchmarks are pre- 
sented: the PostMark benchmark [14] and the SSH- 
build benchmark [36]. These benchmarks crudely 
represent Internet server and software development 
workloads, respectively. 


PostMark was designed to measure the performance 
of a file system used for electronic mail, netnews, 
and web based services. It creates a large num- 
ber of small randomly-sized files (between 512B and 
9KB) and performs a specified number of transac- 
tions on them. Each transaction consists of two sub- 
transactions, with one being a create or delete and 
the other being a read or append. The default con- 
figuration used for the experiments consists of 20,000 
transactions on 5,000 files, and the biases for trans- 
action type are equal. 


The SSH-build benchmark was constructed as a 
replacement for the Andrew file system bench- 
mark [12]. It consists of 3 phases: The unpack phase, 
which unpacks the compressed tar archive of SSH 
v1.2.27 (approximately 1MB in size before decom- 
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Figure 3: PostMark Benchmark 


pression), stresses metadata operations on files of 
varying sizes. The configure phase consists of the 
automatic generation of header files and Makefiles, 
which involves building various small programs that 
check the existing system configuration. The build 
phase compiles, links, and removes temporary files. 
This last phase is the most CPU intensive, but it 
also generates a large number of object files and a 
few executables. 


5.1.2. Comparison of the servers 


To gauge the overall performance of S4, the four 
systems described earlier were compared. As hoped, 
S4 performs comparably to the existing NFS servers. 


Figure 3 shows the results of the PostMark bench- 
mark. The times for both the creation (time to cre- 
ate the initial 5000 files) and transaction phases of 
PostMark are shown for each system. The S4 sys- 
tems’ performance is similar to both BSD and Linux 
NFS performance, doing slightly better due to their 
log structured layout. 


The times of SSH-build’s three phases are shown 
in Figure 4. Performance is similar across the S4 
and BSD configurations. The superior performance 
of the Linux NFS server in the configure stage is due 
to a much lower number of write I/Os than in the 
BSD and S4 servers, apparently due to a flaw in the 
synchronous mount option under Linux. 
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Figure 4: SSH-build Benchmark 


5.1.3 Overhead of the 84 cleaner 


In addition to the more visible process of creating 
new versions, S4 must eventually garbage collect 
data that has expired from the history pool. This 
garbage collection comes at a cost. The potential 
overhead of the cleaner was measured by running 
the PostMark benchmark with 50,000 transactions 
on increasingly large sets of initial files. For each set 
of initial files, the benchmark was run once with the 
cleaner disabled and once with the cleaner compet- 
ing with foreground activity. 


The results shown in Figure 5 represent PostMark 
running with the initial set of files filling between 
2% and 90% of a 2GB disk. As expected, when the 
working set increases, performance of the normal S4 
system degrades due to increasingly poor cache and 
disk locality. The sharp drop in the graph from 2% 
to 10% is caused by the fact that the set of files 
and data expands beyond the bounds of the drive’s 
cache. 


Although the S4 cleaner is slightly different, it was 
expected to behave similarly to a standard LFS 
cleaner, which has up to an approximate 34% de- 
crease in performance [30]. The S4 cleaner is slightly 
more intrusive, degrading performance by approxi- 
mately 50% in the worst case. The greater degra- 
dation is attributed mainly to the additional reads 
necessary when cleaning objects rather than seg- 
ments. In addition, the S4 cleaner has not been 
tuned and does not include known techniques for 
reducing cleaner performance problems [21]. 
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Figure 5: Overhead of foreground cleaning in S4 — 
This figure shows the transaction performance of S4 running 
the PostMark benchmark with varying capacity utilizations. 
The solid line shows system performance on a system without 
cleaning. The dashed line shows system performance in the 
presence of continuous foreground cleaner activity. 


5.1.4 Overhead of the S4 audit log 


In addition to versioning, self-securing storage de- 
vices keep an audit log of all connections and com- 
mands sent to the drive. Recording this audit log 
of events has some cost. In the worst case, all data 
written to the disk belongs to the audit log. In this 
case, one disk write is expected approximately ev- 
ery 750 operations. In the best case, large writes, 
the audit log overhead is almost non-existent, since 
the writes of the audit log blocks are hidden in 
the segment writes of the requests. For the macro- 
benchmarks, the performance penalty ranged be- 
tween 1% and 3%. 


For a more focused view of this overhead, a set of 
micro-benchmarks were run with audit logging en- 
abled and disabled. The micro-benchmarks proceed 
in three phases: creation of 10,000 1KB files (split 
across 10 directories), reads of the newly created files 
in creation order, and deletion of the files in creation 
order. 


Figure 6 shows the results. The create and delete 
phases exhibit a 2.8% and 2.9% decrease in perfor- 
mance, respectively, and the read phase exhibits a 
7.2% decrease in performance. Read performance 
suffers a larger penalty because the audit log blocks 
become interwoven with the data blocks in the create 
phase. This reduces the number of files packed into 
each segment, which in turn increases the number of 
segment reads required. 
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Figure 6: Auditing Overhead in S4 — This figure shows 
the impact on small file performance caused by auditing in- 
coming client requests. 


5.1.5 Fundamental performance costs 


There are three fundamental performance costs 
of self-securing storage: versioning, auditing, and 
garbage collection. Versioning can be achieved at 
virtually no cost by combining journal-based meta- 
data with the LFS structure. Auditing creates a 
small performance penalty of 1% to 3%, according 
to application benchmarks. The final performance 
cost, garbage collection, is more difficult to quantify. 
The extra overhead of S4 cleaning in comparison to 
standard LFS cleaning comes mainly from the dif- 
ference in utilized space due to the history pool. 


The worst-case performance penalty for garbage col- 
lection in S4 can be estimated by comparing the 
cleaning overhead at two space utilizations: the 
space utilized by the active set of objects and the 
space utilized by the active set combined with the 
history pool. For example, assume that the active 
set utilizes 60% of the drive’s space and the history 
pool another 20%. For PostMark, the cleaning over- 
head is the difference between cleaning performance 
and standard performance seen at a given space uti- 
lization in Figure 5. For 60% utilization, the clean- 
ing overhead is 43%. For 80% utilization, it is 53%. 
Thus, in this example, the extra cleaning overhead 
caused by keeping the history pool is 10%. 


There are several possibilities for reducing cleaner 
overhead for all space utilizations. With expected 
detection windows ranging into the hundreds of 
days, it is likely that the history pool can be ex- 
tended until such a time that the drive becomes idle. 
During idle time, the cleaner can run with no observ- 
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Figure 7: Projected Detection Window — The expected 
detection window that could be provided by utilizing 10GB 
of a modern disk drive. This conservative history pool would 
consume only 20% of a 50GB disk’s total capacity. The base- 
line number represents the projected number of days worth of 
history information that can be maintained within this 10GB 
of space. The gray regions show the projected increase that 
cross-version differencing would provide. The black regions 
show the further increase expected from using compression in 
addition to differencing. 


able overhead [2]. Also, recent research into tech- 
nologies such as freeblock scheduling offer standard 
LFS cleaning at almost no cost [18]. This technique 
could be extended for cleaning in S4. 


5.2 Capacity Requirements 


To evaluate the size of the detection window that 
can be provided, three recent workload studies were 
examined. Figure 7 shows the results of approxi- 
mations based on worst-case write behavior. Spa- 
sojevic and Satyanarayanan’s AFS trace study [32] 
reports approximately 143MB per day of write traf- 
fic per file server. The AFS study was conducted us- 
ing 70 servers (consisting of 32, 000 cells) distributed 
across the wide area, containing a total of 200GB of 
data. Based on this study, using just 20% of a mod- 
ern 50GB disk would yield over 70 days of history 
data. Even if the writes consume 1GB per day per 
server, as was seen by Vogels’ Windows NT file us- 
age study [34], 10 days worth of history data can be 
provided. The NT study consisted of 45 machines 
split into personal, shared, and administrative do- 
mains running workloads of scientific processing, de- 
velopment, and other administrative tasks. Santry, 
et al. (29] report a write data rate of 110MB per 
day. In this case, over 90 days of data could be 
kept. Their environment consisted of a single file 
system holding 15GB of data that was being used 
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by a dozen researchers for development. 


Much work has been done in evaluating the efficiency 
of differencing and compression [3, 4, 5]. To briefly 
explore the potential benefits for $4, its code base 
was retrieved from the CVS repository at a single 
point each day for a week. After compiling the code, 
both differencing and differencing with compression 
were applied between each tree and its direct neigh- 
bor in time using Xdelta [19, 20]. After applying 
differencing, the space efficiency increased by 200%. 
Applying compression added an additional 200% for 
a total space efficiency of 500%. These results are in 
line with previous work. Applying these estimates to 
the above workloads indicates that a 10GB history 
pool can provide a detection window of between 50 
and 470 days. 


6 Discussion 


This section discusses several important implications 
of self-securing storage. 


Selective versioning: There are data that users 
would prefer not to have backed up at all. The com- 
mon approach to this is to store them in directories 
known to be skipped by the backup system. Since 
one of the goals of S4 is to allow recovery of exploit 
tools, it does not support designating objects as non- 
versioned. A system may be configured with non-S4 
partitions to support selective versioning. While this 
would provide a way to prevent versioning of tempo- 
rary files and other non-critical data, it would also 
create a location where an intruder could temporar- 
ily store exploit tools without fear that they will be 
recovered. 


Versioning vs. snapshots: Self-securing stor- 
age can be implemented with frequent copy-on-write 
snapshots [11, 12, 17] instead of versioning, so long 
as snapshots are kept for the full detection window. 
Although the audit log can still provide a record of 
what blocks are changed, snapshots often will not al- 
low administrators to recover short-lived files (e.g., 
exploit tools) or intermediate versions (e.g., system 
log file updates). Also, legitimate changes are only 
guaranteed to survive malicious activity if they sur- 
vive to the next snapshot time. Of course, the po- 
tential scope of such problems can be reduced by 
shrinking the time between snapshots. The compre- 
hensive versioning promoted in this paper represents 
the natural end-point of such shrinking-—every mod- 
ification creates a new snapshot. 


Versioning file systems vs. self-securing stor- 
age: Versioning file systems excel at providing users 
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with a safety net for recovery from accidents. They 
maintain old file versions long after they would be 
reclaimed by the S4 system, but they provide lit- 
tle additional system security. This is because they 
rely on the host’s OS for security and aggressively 
prune apparently insignificant versions. By combin- 
ing self-securing storage with long-term landmark 
versioning [28], recovery from users’ accidents could 
be enhanced while also maintaining the benefits of 
intrusion survival. 


Self-securing storage for databases: Most 
databases log all changes in order to protect internal 
consistency in the face of system crashes. Some in- 
stitutions also retain these logs for long-term audit- 
ing purposes. All information needed to understand 
and recover from malicious behavior can be kept, in 
database-specific form, in these logs. Self-securing 
storage can increase the post-intrusion recoverabil- 
ity of database systems in two ways: (1) by prevent- 
ing undetectable tampering with stored log records, 
and (2) by preventing undetectable changes to data 
that bypass the log. After an intrusion, self-securing 
storage allows a database system to verify its log’s 
integrity and confirm that all changes are correctly 
reflected in the log—the database system can then 
safely use its log for subsequent recovery. 


Client-side cache effects: In order to improve ef- 
ficiency, most client systems use caches to minimize 
storage latencies. This is at odds with the desire 
to have storage devices audit users’ accesses and 
capture exploit tools. Client-side read caches hide 
data dependency information that would otherwise 
be available to the drive in the form of reads followed 
quickly by writes. However, this information could 
be provided by client systems as (questionable) hints 
during writes. Write caches cause a more serious 
problem when files are created then quickly deleted, 
thus never being sent to the drive. This could cause 
difficulties with capturing exploit tools, since they 
may never be written to the drive. Although client 
cache effects may obscure some of the activity in the 
client system, data that are stored on a self-securing 
storage device are still completely protected. 


Object-based vs. block-based storage: Imple- 
menting a self-securing storage device with a block 
interface adds several difficulties. Since objects are 
designed to contain one data item (file or directory), 
enforcing access control at this level is much more 
manageable than attempting to assign permissions 
on a per-block basis. In addition, maintaining ver- 
sions of objects as a whole, rather than having to col- 
lect and correlate individual blocks, simplifies recov- 
ery tools and internal reorganization mechanisms. 


Multi-device coordination: Multi-device coordi- 
nation is necessary for operations such as striping 
data or implementing RAID across multiple self- 
securing disks or file servers. In addition to the co- 
ordination necessary to ensure that multiple copies 
of data are synchronized, recovery operations must 
also coordinate old versions. On the other hand, 
clusters of self-securing storage devices could main- 
tain a single history pool and balance the load of 
versioning objects. Note that a self-securing storage 
device containing several disks (e.g., a self-securing 
disk array) does not have these issues. Additionally, 
it has the ability to keep old versions and current 
data on separate disks. 


7 Related Work 


Self-securing storage and S4 build on many ideas 
from previous work. Perhaps the clearest example is 
versioning: many versioned file systems have helped 
their users to recover from mistakes (22, 10]. Santry, 
et al., provide a good discussion of techniques for 
traversing versions and deciding what to retain [29]. 
S4’s history pool corresponds to Elephant’s “keep 
all” policy (during its detection window), and it uses 
Elephant’s time-based access. The primary advan- 
tage of S4 over such systems is that it has been par- 
titioned from client operating systems. While this 
creates another layer of abstraction, it adds to the 
survivability of the storage. 


A self-securing disk drive would be another instance 
of many recent “smart disk” systems [1, 8, 15, 26, 
35]. All of these exploit the increasing computation 
power of such devices. Some also put these devices 
on networks and exploit an object-based interface. 
There is now an ANSI X3T10 (SCSI) working group 
looking to create a new standard for object-based 
storage devices. The S4 interface is similar to these. 


The standard method of intrusion recovery is to keep 
a periodic backup of files on trusted storage. Sev- 
eral file systems simplify this process by allowing a 
snapshot to be taken of a file system [11, 12, 17]. 
This snapshot can then be backed-up with standard 
file system tools. Spiralog [13] uses a log-structured 
file system to allow for backups to be made during 
system operation by simply recording the entire log 
to tertiary storage. While these systems are effective 
in preventing the loss of long-existing critical data, 
the window of time in which data can be destroyed 
or tampered with is much larger than in S4—often 
24 hours or more. Also, these systems are generally 
reliant upon a system administrator for operation, 
with a corresponding increase in cost and potential 
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for human error. In addition, intrusion diagnosis is 
extremely difficult in such systems. Permanent file 
storage [25] provides an unlimited set of puncture- 
proof backups over time. These systems are unlikely 
to become the first-line of storage because of lengthy 
access times. 


S4 borrows on-disk data structures from several sys- 
tems. Unlike Elephant’s FFS-like layout [23], the 
disk layout of S4 more closely resembles that of a 
log structured file system [27]. Many file systems 
use journaling to improve performance while main- 
taining disk consistency [6, 31, 33]. However, these 
systems delete the journal information once check- 
points ensure that the corresponding blocks are all 
on disk. $4’s journal-based metadata persistently 
stores metadata versions in a space-efficient man- 
ner. 


8 Conclusions 


Self-securing storage ensures data and audit log sur- 
vival in the presence of successful intrusions and even 
compromised host operating systems. Experiments 
with the S4 prototype show that self-securing stor- 
age devices can achieve performance that is com- 
parable to existing storage appliances. In addition, 
analysis of recent workload studies suggest that com- 
plete version histories can be kept for several weeks 
on state-of-the-art disk drives. 
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Abstract 


Internet users increasingly rely on publicly avail- 
able data for everything from software installation 
to investment decisions. Unfortunately, the vast ma- 
jority of public content on the Internet comes with 
no integrity or authenticity guarantees. This paper 
presents the self-certifying read-only file system, a 
content distribution system providing secure, scal- 
able access to public, read-only data. 


The read-only file system makes the security of 
published content independent from that of the dis- 
tribution infrastructure. In a secure area (per- 
haps off-line), a publisher creates a digitally-signed 
database out of a file system’s contents. The pub- 
lisher then replicates the database on untrusted 
content-distribution servers, allowing for high avail- 
ability. The read-only file system protocol further- 
more pushes the cryptographic cost of content verifi- 
cation entirely onto clients, allowing servers to scale 
to a large number of clients. Measurements of an 
implementation show that an individual server run- 
ning on a 550 Mhz Pentium III with FreeBSD can 
support 1,012 connections per second and 300 con- 
current clients compiling a large software package. 


1 Introduction 


This paper presents the design and implementa- 
tion of a distributed file system that allows a large 
number of clients to access public, read-only data se- 
curely. Read-only data can have high performance, 
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availability, and security needs. Some examples in- 
clude executable binaries, popular software distribu- 
tions, bindings from hostnames to addresses or pub- 
lic keys, and popular, static Web pages. In many 
cases, people widely replicate and cache such data to 
improve performance and availability—for instance, 
volunteers often set up mirrors of popular operat- 
ing system distributions. Unfortunately, replication 
generally comes at the cost of security. Each replica 
adds a new opportunity for attackers to break in 
and tamper with data, or even for the replica’s own 
administrator to maliciously serve modified data. 


People have introduced a number of ad hoc mech- 
anisms for dealing with the security of public data, 
but these mechanisms often prove incomplete and of 
limited utility to other applications. For instance, 
binary distributions of Linux software packages in 
RPM [28] format can contain PGP signatures. How- 
ever, few people actually check these signatures, and 
packages cannot be revoked. In addition, when pack- 
ages depend on other packages being installed first, 
the dependencies cannot be made secure (e.g., one 
package cannot explicitly require another package to 
be signed by the same author). As another exam- 
ple, names of servers are typically bound to public 
keys through digitally signed certificates issued by a 
trusted authority. These certificates are distributed 
by the servers they authenticate, which naturally al- 
lows scaling to large numbers of servers. However, 
this approach also results in certificates having a 
long duration, which complicates revocation to the 
point that in practice many systems omit it. 


To distribute public, read-only data securely, we 
have built a high-performance, secure, read-only file 
system designed to be widely replicated on untrusted 
servers. We chose to build a file system because of 
the ease with which one can refer to the file names- 
pace in almost any context—from shell scripts to C 
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code to a Web browser’s location field. Thus, the 
file system can support a wide range of applications, 
such as certificate authorities, that one could not 
ordinarily implement using a network file system. 


Each read-only file system has a public key associ- 
ated with it. We use the naming scheme of SF'S [17], 
in which file names contain public keys. Thus, users 
can employ any of SFS’s various key management 
techniques to obtain the public keys of file systems. 


In our approach, an administrator creates a 
database of a file system’s contents and digitally 
signs it off-line using the file system’s private 
key. The administrator then widely replicates the 
database on untrusted machines. There, a simple 
and efficient server program serves the contents of 
the database to clients, without needing access to the 
file system’s private key. DNS round-robin schedul- 
ing or more advanced techniques can be used to dis- 
tribute the load between multiple replicas. A trusted 
program on the client machine checks the authentic- 
ity of data before returning it to the user. 


The read-only file system avoids performing any 
cryptographic operations on servers and keeps the 
overhead of cryptography low on clients. We ac- 
complish this with two simple techniques. First, 
blocks and inodes are named by handles, which are 
collision-resistant cryptographic hashes of their con- 
tents. Second, groups of handles are hashed recur- 
sively, producing a tree of hashes. Inodes contain 
the handles of a file’s blocks. Directory blocks con- 
tain lists of file name to handle bindings. Using the 
handle of the root inode of a file system, a client 
can verify the contents of any block by recursively 
checking hashes. 


The protocol between the client and server con- 
sists of only two remote procedure calls: one to fetch 
the signed handle for the root inode of a file system, 
and one to fetch the data (inode or file content) for 
a given handle. Since the server does not have to 
understand what it is serving, its implementation is 
both trivial and highly efficient: it simply looks up 
handles in the database and sends them back to the 
client. 


We named the file system presented in this paper 
the SFS read-only file system because it uses SF'S’s 
naming scheme and fits into the SFS framework. 
The server-side of the file system consists of two pro- 
grams: sfsrodb for creating signed databases off- 
line, and sfsrosd for serving signed databases from 
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untrusted machines. The client-side software con- 
sists of a daemon, sfsrocd, that queries databases 
through sfsrosd processes, verifies the results, and 
translates them into a file system. 


A performance evaluation shows that sfsrosd can 
support 1012 short-lived connections per second on 
a PC (a 550 Mhz Pentium III with 256 Mbyte of 
memory) running FreeBSD, which is 26 times better 
than a standard read-write SFS file server and 92 
times better than a secure web server. In fact, the 
performance of the read-only server is limited mostly 
by the number of TCP connections per second, not 
by the overhead of cryptography, which is offloaded 
to clients. For applications like sustained downloads 
that require longer-lived connections, sfsrosd can 
support 300 concurrent sessions while still saturating 
a fast Ethernet. 


The rest of this paper is organized as follows. Sec- 
tion 2 relates our design to previous work. Section 3 
details the design of the read-only server. Section 4 
describes its implementation. Section 5 presents the 
applications of the read-only server. Section 6 evalu- 
ates the performance of these application and com- 
pares them to existing approaches. Section 7 con- 
cludes. 


2 Related Work 


We are unaware of a read-only (or read-write) file 
system that can support a high number of simultane- 
ous clients and provide strong security. Many sites 
use a separate file system to replicate and export 
read-only binaries, providing high availability and 
high performance. AFS supports read-only volumes 
to achieve replication [22]. However, in all these 
cases replicas are stored on trusted servers. Some 
file systems provide high security (e.g., the SFS read- 
write file system [17] or Echo [2]), but compared to 
the SFS read-only file system these servers do not 
scale well with the number of clients because their 
servers perform expensive cryptographic operations 
in the critical path (e.g., the SFS read-write server 
performs one private-key operation per client con- 
nection, which takes about 24 msec on a 550 Mhz 
Pentium III). 


Secure DNS [8] is an example of a read-only data 
service that provides security, high availability, and 
high performance. In secure DNS, each individual 
resource record is signed. This approach does not 
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work for file systems. If each inode and 8 Kbyte- 
block of a moderate file system—for instance, the 
635 Mbyte Red Hat 6.2 i386 distribution—had to 
be signed individually with a 1,024-bit key, the sign- 
ing alone would take about 36 minutes (90,000 x 
24 msec) on a 550 Mhz Pentium III. A number of 
read-only data services, such as FTP archives, are 
100 times bigger, making individual block signing 
impractical—particularly since we want to allow fre- 
quent updates of the database and rapid expiration 
of old signatures. 


Secure HTTP servers are another example of 
servers that provide access to mostly read-only data. 
These servers are difficult to replicate on untrusted 
machines, however, since their private keys have to 
be on-line to prove their identity to clients. Further- 
more, private-key operations are expensive and are 
in the critical path: every SSL connection requires 
the server to compute modular exponentiations as 
part of the public-key cryptography [11]. As a re- 
sult, software-only secure Web servers achieve low 
throughput (with a 1,024-bit key, IE and Netscape 
servers can typically support around 15 connections 
per second). 


Content distribution networks built by companies 
such as Adero, Akamai, Cisco, and Digital Island are 
an efficient and highly-available way of distributing 
static Web content. Content stored on these net- 
works is dynamically replicated on trusted caches 
scattered around the Internet. Web browsers then 
connect to a cache that provides high performance. 
The approach described in this paper would allow 
read-only Web content to be replicated securely to 
untrusted machines and would provide strong data 
integrity to clients that run our software. For clients 
that don’t run our software, one can easily configure 
any Web server on an SF client to serve the /sfs 
directory, trivially creating a Web-to-SFS gateway 
for any Web clients that trust the server. 


Signed software distributions are common in the 
open-source community. In the Linux community, 
for example, a creator or distributor of a software 
package can sign RPM [28] files with PGP or.GNU 
GPG. RPM also supports MD5 hashes. A person 
downloading the software can optionally check the 
signature. Red Hat Software, for example, publishes 
their PGP public key on their Web site and signs all 
their software distributions with the corresponding 
private key. This setup provides some guarantees to 
the person who checks the signature on the RPM file 
and who makes sure that the public key indeed be- 
longs to Red Hat. However, RPMs do not provide 


an expiration time or revocation support. If users 
were running the SFS client software and RPMs were 
stored on SFS read-only file servers, the server would 
be authenticated transparently and the data would 
be checked transparently for integrity and recent- 
ness. 


The read-only file system makes extensive use of 
hash trees, which have appeared in numerous other 
systems. Merkle used a hierarchy of hashes for an 
efficient digital signature scheme [18]. In the con- 
text of file systems, the Byzantine-fault-tolerant file 
system uses hierarchical hashes for efficient state 
transfers between clients and replicas [5, 6]. The 
cryptographic storage file system [12] uses crypto- 
graphic hashes in a similar fashion to the SFS read- 
only file system. Duchamp uses hierarchical hashes 
to efficiently compare two file systems in a toolkit 
for partially-connected operation [7]. TDB [16] uses 
hash trees combined with a small amount of trusted 
storages to construct a trusted database system on 
untrusted storage. Finally, a version of a network- 
attached storage device uses an incremental “Hash 
and MAC” scheme to reduce the cost of protecting 
the integrity of read traffic in storage devices that 
are unable to generate a MAC at full data transfer 
rates [14]. 


A number of proposals have been developed to 
make digital signatures cheaper to compute (13, 20], 
some involving hash trees [27]. These proposal en- 
able signing hundreds of packets per second in ap- 
plications such as multicast streams. However, if 
applied to file systems, these techniques would in- 
troduce complications such as increased signature 
size. Moreover, because the SFS read-only file sys- 
tem was designed to avoid trusting servers, read- 
only servers must function without access to a file 
system’s private key. This prevents any use of dy- 
namically computed digital signatures, regardless of 
the computational cost. 


3  SFS read-only file system 


Figure 1 shows the overall architecture of the SFS 
read-only file system. In a secure area, an admin- 
istrator runs the SFS read-only database generator 
(sfsrodb), passing as arguments a directory of files 
to export and a file containing a private key. The 
administrator replicates this database on a number 
of untrusted machines, each of which runs a copy of 
the SFS read-only server daemon (sfsrosd). The 
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Figure 1: The SFS read-only file system. Shaded boxes show the trusted computing base. 


read-only server is a simple program that looks up 
the data for a given handle in the replica’s database 
and returns the data to the client. 


Most of the actual file system is implemented by 
the SFS read-only client daemon (sfsrocd), which 
runs on the client’s machine. It handles file sys- 
tem requests from the local operating system and 
responds to them. The read-only client understands 
the format of inodes and directories. It parses path- 
names, searches directories, looks up blocks of files, 
etc. 


In order to respond to requests from the local op- 
erating system, the client retrieves data from one of 
the servers that has a replica of the database. DNS 
round-robin scheduling or more advanced techniques 
(e.g., [15]) can be used to select a replica that pro- 
vides good performance. Since the replica may run 
on untrusted hardware, the client must verify that 
any data sent by the server was indeed signed by a 
database generator with the appropriate private key. 


The SFS read-only file system assumes that an at- 
tacker may compromise and assume control of any 
read-only server machine. It therefore cannot pre- 
vent denial-of-service from an attacker penetrating 
and shutting down every server for a given file sys- 
tem. However the client does ensure that any data 
retrieved from a server is authentic, no older than a 
file system-configurable consistency period, and also 
no older than any previously retrieved data from the 
same file system. The read-only file system does not 
provide confidentiality. Thus, data on replicas does 
not have to be kept secret from attackers. The key 
security property of the read-only file system is in- 
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Cost (usec) 


Sign 68 byte fsinfo 
Verify 68 byte fsinfo 


SHA-1 256 byte iv+inode 
SHA-1 8,208 byte iv+block 
Figure 2: 


550 Mhz Pentium III. Signing and verification use 
1,024-bit Rabin-Williams keys. 





Performance of base primitives on a 


tegrity. 


Our design can also in principle be used to provide 
non-repudiation of file system contents. An admin- 
istrator of a server could commit to keeping every 
file system he ever signed. Then, clients could just 
record the signed root handle. The server would 
be required to prove what the file system contained 
on any previous day. In this way, an administra- 
tor could never falesly deny that a file previously 
existed. 


Figure 2 lists the cryptographic primitives that 
we use in the read-only file system. We chose the 
Rabin public key cryptosystem [26] for its fast sig- 
nature verification time. The implementation is se- 
cure against chosen-message attacks (using the re- 
dundancy function proposed in [1]). As can be seen 
from Table 2, computing digital signatures is some- 
what expensive, but verifying them is takes only 
82 psec—far cheaper than a typical network round 
trip time, in fact. 


SFS also uses the SHA-1 [9] cryptographic hash 
function. SHA-1 is a collision-resistant hash function 
that produces a 20-byte output from an arbitrary- 
length input. Finding any two inputs of SHA-1 that 


USENIX Association 


USENIX Association 


struct FSINFO { 
sfs_time start; 
unsigned duration; 
opaque iv([i6]; 
sfs_hash rootfh; 
sfs_hash fhdb; 

yi 


Figure 3: Contents of the digitally signed root of an 
SFS read-only file system. 


produce the same output is believed to be computa- 
tionally intractable. Modern machines can typically 
compute SHA-1 at a rate greater than the local area 
network bandwidth. Thus, one can reasonably hash 
the result of every RPC in a network file system pro- 
tocol. 


The rest of this section describes how we use these 
primitives to efficiently provide authenticity and re- 
centness of data. 


3.1 SFS read-only Protocol 


The read-only protocol uses two RPCs: getfsinfo 
and getdata. getfsinfo takes no arguments and re- 
turns a digitally signed FSINFO structure, depicted 
in Figure 3. The SFS client verifies the signature 
using the public key embedded in the server’s name. 
The getdata RPC takes a 20-byte hash value as an 
argument and returns a data block producing that 
hash value. The client uses getdata to retrieve parts 
of the file system requested by the user, and veri- 
fies the authenticity of the blocks using the FSINFO 
structure. 


Because read-only file systems may reside on un- 
trusted servers, the protocol relies on time to enforce 
consistency loosely but securely. The start field of 
FSINFO indicates the time (in seconds since 1970) at 
which a file system was signed. Clients cache the 
highest value they have seen to prevent an attacker 
from rolling back the file system to a previous ver- 
sion. The duration field signifies the length of time 
for which the data structure should be considered 
valid. It represents a commitment on the part of a 
file system’s owner to issue a newly signed file sys- 
tem within a certain period of time. Clients reject 
an FSINFO structure when the current time exceeds 
start + duration. 


The file system names arbitrary-length blocks of 
data with fixed-size handles. The handle for a 





data item x is computed using SHA-1: H(a2) = 
SHA-l(iv,z). iv, the initialization vector, is ran- 
domly chosen by sfsrodb the first time the admin- 
istrator creates a database for a file system. It en- 
sures that simply knowing one particular collision of 
SHA-1 will not immediately give attackers collisions 
of functions actually used by SFS file systems. 


rootfh is the handle of the file system’s root di- 
rectory. It is a hash of the root directory’s inode 
structure, which through recursive use of H spec- 
ifies the contents of the entire file system, as de- 
scribed below. fhdb is the the hash of the root of a 
tree that contains every handle reachable from the 
root directory. fhdb lets clients securely verify that a 
particular handle does not exist, so that they can re- 
turn stale file handle errors when file systems change. 
fhdb will not be necessary in future versions of the 
software, as described in Section 3.4. 


3.2 SFS read-only inode 


Figure 4 shows the format of an inode in the 
read-only file system. The inode begins with some 
metadata, including the file’s type (regular file, exe- 
cutable file, directory, opaque directory, or symbolic 
link), size, and modification time. Permissions are 
not included because they can be synthesized on the 
client. The inode then contains handles of succes- 
sive 8 Kbyte blocks of file data. If the file contains 
more than eight blocks, the inode contains the han- 
dle of an indirect block, which in turn contains han- 
dles of file blocks. Similarly, for larger files, an inode 
can also contain the handles of double- and triple- 
indirect blocks. In this way, the blocks of small files 
can be verified directly from the inode, while inodes 
can also indirectly verify large files—an approach 
similar to the on-disk data structures of the Unix 
File System [19]. 


3.3. Database generator 


To export a file system, a system administrator 
produces a signed database from a source directory 
in an existing file system. The database contains 
file data blocks and inodes indexed by their hash 
values. In essence, it is analogous to a file system in 
which inode and block numbers have been replaced 
by cryptographic hashes. 


The database generator utility traverses the given 
file system depth-first to builds the database. The 
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Figure 4: Format of a read-only file system inode. 


leaves of the file system tree are files or symbolic 
links. For each regular file in a directory, the 
database generator creates a read-only inode struc- 
ture and fills in the metadata. Then, it reads the 
blocks of the file. For each block, sf srodb hashes the 
data in that block to compute its handle, and then 
inserts the block into the database under the handle 
(i.e., a lookup on the handle will return the block). 
The hash value is also stored in an inode. When all 
file blocks of a file are inserted into the database, the 
filled-out inode is inserted into the database under 
its hash value. 


When all files in a given directory have been in- 
serted into the database, the generator utility inserts 
a file corresponding to the directory itself. The file 
blocks of a directory contain lists of (name, handle) 
pairs; the directory’s inode contains hashes of those 
blocks (and possibly indirect blocks) as for regu- 
lar files. Directories are sorted lexicographically by 
name. Thus, clients can avoid traversing the entire 
directory by performing a binary search when look- 
ing up files in very large directories (e.g., a directory 
that contains all names in the .com domain). This 
property also allows clients to verify inexpensively 
whether a file name exists or not, without having to 
read the whole directory. 


Each directory also contains its full pathname 
from the root of the file system. The client uses 
the pathname to evaluate the file name “. .” locally, 
using it as a reference for any directory’s parent. 
(Since a directory inode’s handle depends on the 
handles of all subdirectories, a circular dependency 
makes it impossible to create directory entries of the 
form (“..”,parent’s handle).) Clients verify that a 
directory contains the proper pathname when first 
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looking it up. This is not strictly necessary—an ad- 
ministrator signing a bad database should expect 
undefined interpretations by clients. However, the 
sanity check reduces potentially confusing behavior 
on clients of malformed file systems. 


Inodes for symbolic links are slightly different 
from the one depicted in Figure 4. Instead of con- 
taining handles of blocks, the inode directly contains 
the destination path for the symbolic link. 


To avoid inconveniencing users with large direc- 
tories, server administrators can set the type field 
in an inode to “opaque directory.” When users 
list an opaque directory, they see only entries they 
have already referenced—somewhat like Unix “au- 
tomounter” directories [4]. Opaque directories are 
well-suited to giant directories containing, for in- 
stance, all names in the . com domain or all all name- 
to-key bindings issued by a particular certificate 
authority. If one used non-opaque directories for 
these applications, users could inadvertently down- 
load hundreds of megabytes of directory data by typ- 
ing 1s or using file name completion in the wrong 
directory. 


After the whole directory tree has been inserted 
into the database, the generator utility fills out an 
FSINFO structure and signs it with the private key 
of the file system. For simplicity, sfsrodb stores 
the signed FSINFO structure in the database under 
a well-known, reserved key. 


The database generator stores inodes, directories, 
and file block data in the database in XDR mar- 
shaled form {24]. Using XDR has three advantages. 
First, it simplifies the client implementation, as the 


USENIX Association 


USENIX Association 


client can use the SFS RPC and crypto libraries to 
parse file system data. Second, the XDR represen- 
tation clearly defines what the database contains, 
which simplifies writing programs that process the 
database (e.g., a debugging program). Finally, it im- 
proves performance of the read-only server by saving 
it from having to do any marshaling. 


3.4 Updating file systems 


The biggest challenge in updating read-only file 
systems is dealing with data that no longer exists 
in the file system. When a file system changes, the 
administrator generates a new database and pushes 
it out to the server replicas. Files that persist across 
file system versions will keep the same handles. How- 
ever, when a file is removed or modified, clients can 
end up requesting handles no longer in the database. 
In this case, the read-only server replies with an er- 
ror. 


Unfortunately, since read-only servers (and the 
network) are not trusted, clients cannot necessar- 
ily believe “handle not found” errors they receive. 
Though a compromised server can hang a client by 
refusing to answer RPCs, it must not be able to make 
programs spuriously abort with stale file handle er- 
rors. Otherwise, for instance, an application looking 
up a key revocation certificate in a read-only file sys- 
tem might falsely believe that the certificate did not 
exist. 


We have two schemes to let clients securely deter- 
mine whether a given file handle exists: the current 
scheme uses the fhdb field of the FSINFO structure 
to verify that a handle no longer exists. fhdb is the 
root of a hash tree, the leaf nodes of which contain a 
sorted list of every handle in the file system. Thus, 
clients can easily walk the hash tree (using getfs- 
info) to see whether the database contains a given 
file handle. 


The fhdb scheme has advantages. It allows files 
to persist in the database even after they have been 
deleted, as not every handle in the database need be 
reachable from the root directory. Thus, by keeping 
handles of deleted files in a few subsequent revisions 
of a database, a system administrator can support 
the traditional Unix semantics that one can continue 
to access an open file even after it has been deleted. 


Unfortunately, fhdb has several drawbacks. Even 
small changes to the file system cause most of the 


hash tree under fhdb to change (making incremental 
database updates unnecessarily expensive). Further- 
more, in the read-only file system, because handles 
are based on file contents, there is no distinction be- 
tween modifying a file and deleting then recreating 
it. In some situations, one doesn’t want to have to 
close and reopen a file to see changes. (This is al- 
ways the case for directories, which therefore need 
a different mechanism anyway.) Finally, under the 
fhdb scheme, a server cannot change its iv without 
causing all open files to become stale on all clients. 


To avoid these problems, future versions of the 
software will eliminate fhdb. Instead, the client will 
track the pathnames of all files accessed in read- 
only file systems. When a server FSINFO structure 
is updated, the client will walk the file namespace 
to find the new inode corresponding to the name of 
each open file. Those who really want an open file 
never to change can still emulate the old semantics 
(albeit somewhat inelegantly) using a symbolic link 
to switch between the old and new version of a file 
while allowing both to exist simultaneously. Once 
clients track the pathnames of files, directories need 
no longer contain their full pathnames: clients will 
have enough state to evaluate the parent directory 
name “..” on their own. 


The read-only inode structure contains the mod- 
ification and “inode change” times of a file. Thus, 
sfsrodb could potentially update the database in- 
crementally after changes are made to the file sys- 
tem, recomputing only the hashes from changed files 
up to the root handle and the signature on the 
FSINFO structure. Our current implementation of 
sfsrodb creates a completely new database for each 
version of the file system, but we plan to support 
incremental updates in a future release. 


3.5 Incremental transfer 


We built a simple utility program, pulldb, that 
incrementally transfers a newer version of a database 
from one replica to another. The program fetches 
FSINFO from the source replica, and checks if the lo- 
cal copy of the database is out of date. If so, the 
program recursively traverses the entire file system, 
starting from the new root file handle, building on 
the side a list of all active handles. For each handle 
encountered, if the handle does not already exist in 
the local database, pulldb fetches the corresponding 
data with a getdata RPC and stores it in database. 
After the traversal, pulldb swaps the FSINFO struc- 
ture in the database and then deletes all handles no 
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longer in the file system. If a failure appears before 
the transfer is completed, the program can just be 
restarted, since the whole operation is idempotent. 


3.6 Read-only server 


The server program sfsrosd is a trivial-—only 400 
lines of C++. sfsrosd knows nothing about the 
structure of the file system it serves; it simply gets 
requests for handles, looks up the handles in the 
database, and returns their values to the client. It 
also fields getfsinfo and SFS connect RPCs, to which 
it replies with two static data structures cached in 
memory. 


3.7 Read-only client 


The client program constitutes the bulk of the 
code in the read-only file system (1,500 lines of 
C++). The read-only client behaves like an NF'S3 [3] 
server, allowing it to communicate with the oper- 
ating system through ordinary networking system 
calls. The read-only client resolves pathnames for 
file name lookups and handles reads of files, directo- 
ries, and symbolic links. It relies on the server only 
for serving blocks of data, not for interpreting or ver- 
ifying those blocks. The client checks the validity of 
all blocks it receives against the hashes by which it 
requested them. 


We demonstrate how the client works by ex- 
ample. Consider a user reading the file /sfs/ 
sfs .mit.edu: bzccShder7cuc86kf6qswyx6yuemn 
w69/README, where bzccShder7cuc86kf6qswyx6 
yuemnw69 is the representation of the public key 
of the server storing the file README. (In practice, 
symbolic links save users from ever having to see or 
type pathnames like this.) 


The local operating system’s NFS client will call 
into the protocol-independent SFS client software, 
asking for the directory /sfs/sfs.mit.edu:bzcc5 
hder7cuc86kf6qswyx6yuemnw69/. The client will 
contact sfs.mit .edu, which will respond that it im- 
plements the read-only file system protocol. At that 
point, the protocol-independent SFS client daemon 
will pass the connection off to the read-only client, 
which will subsequently be asked by the kernel to 
interpret the file named README. 


The client makes a getfsinfo RPC to the server 
to get the file system’s signed FSINFO structure. It 
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verifies the signature on the structure, ensures that 
the start field is no older than its previous value 
if the client has seen this file system before, and 
ensures that start + duration is in the future. 


The client then obtains the root directory’s in- 
ode by doing a getdata RPC on the rootfh field of 
FSINFO. Given that inode, it looks up the file README 
by doing a binary search among the blocks of the di- 
rectory, which it retrieves through getdata calls on 
the block handles in the directory’s inode (and pos- 
sibly indirect blocks). When the client has the di- 
rectory entry (README, handle), it calls getdata on 
handle to obtain README’s inode. Finally, the client 
can retrieve the contents of README by calling get- 
data on the block handles in its inode. 


4 Implementation 


As illustrated in Figure 5, the read-only file sys- 
tem is implemented as two new daemons (sfsrocd 
and sfsrosd) in the SFS system [17]. sfsrodb is a 
stand-alone program. 


sfsrocd and sfsrosd communicate with Sun 
RPC over a TCP connection. (The exact message 
formats are described in the XDR protocol descrip- 
tion language [24].) We also use XDR to define cryp- 
tographic operations. Any data that the read-only 
file system hashes or signs is defined as an XDR 
data structure; SFS computes the hash or signature 
on the raw, marshaled bytes. 


sfsrocd, sfsrosd, and sfsrodb are written in 
C++. To handle many connections simultaneously, 
the client and server use SFS’s asynchronous RPC 
library. Both programs are single-threaded, but the 
RPC library allows the client to have many out- 
standing RPCs. 


Because of SFS’s support for developing new 
servers and sfsrosd’s simplicity, the implemen- 
tation of sfsrosd is trivial. It gets requests 
for data blocks by file handle, looks up pre- 
formatted responses in a B-tree, and responds to the 
client. The current implementation uses the Sleep- 
ycat database’s B-tree [23]. In the measurements 
sfsrosd accesses the database synchronously. 


The implementations of the other two programs 
(sfsrodb and sfsrocd) are more interesting; we dis- 
cuss them in more detail. 
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Figure 5: Implementation overview of the read-only file system in the SFS framework. 


4.1 sfsrodb 


The implementation of sfsrodb is simple. It is 
a short stand-alone C++ program. It walks down 
the given file system and builds a database with file 
system data indexed by the cryptographic hash of 
the data. After building the database, it creates an 
FSINFO structure and signs it with the private key 
of the file system. 


The disadvantage of storing the data in marshaled 
form is that physical representation of the data is 
slightly larger than the actual data. For instance, 
an 8 Kbyte file block is slightly larger than 8 Kbyte. 
The Sleepycat database does not support values just 
over a power of 2 in size very well; we are develop- 
ing a light-weight, asynchronous B-tree that handles 
such odd-sized values well. 


A benefit of storing blocks under their hash is that 
blocks from different files that have the same hash 
will only be stored once in the database. If a file 
system contains blocks with identical content among 
multiple files, then sf srodb stores just one block un- 
der the hash. In the RedHat 6.2 distribution, 5,253 
out of 80,508 file data blocks share their hash with 
another block. The overlap is much greater if one 
makes the same data available in two different for- 
mats (for instance, the contents of the RedHat 6.2 
distribution, and the image of a CD-ROM contain- 
ing that distribution). 


4.2 sfsrocd 


sfsrocd implements four caches with LRU re- 
placement policies to improve performance by avoid- 
ing RPCs to sfsrosd. It maintains an inode cache, 
an indirect-block cache, a small file-block cache, and 
a cache for directory entries. 


sfsrocd’s small file-block cache primarily opti- 
mizes the case of the same block appearing in mul- 
tiple files. In general, sfsrocd relies on the local 
operating system’s buffer cache to cache the file con- 
tents. Thus, any additional caching of file contents 
will tend to waste memory unless a block in appears 
multiple places. The small block cache optimizes 
common cases—such as a file with many blocks of 
all zeros—without dedicating too much memory to 
redundant caching. 


Indirect blocks are cached so that sfsrocd can 
quickly fetch and verify multiple blocks from a large 
file without refetching the indirect blocks. sfsrocd 
does not prefetch because most operating systems 
already implement prefetching locally. 


5 Applications 


To demonstrate the usefulness of the SFS read- 
only file system, we describe two applications that 
we measure in Section 6: certificate authorities and 
software distribution. 


5.1 Certificate Authorities 


Certificate authorities for the Internet are servers 
that publish certificates binding hostnames to public 
keys. On the Web, for instance, the certificate au- 
thority Verisign certifies server keys for Web servers. 
Verisign signs the domain name and the public key of 
the Web server in an X.509 certificate, and returns 
this to the Web server administrator [10]. When 
a browser connects to the Web server with secure 
HTTP, the server responds with the certificate. The 
browser checks the validity of the certificate by ver- 
ifying it with Verisign’s public key. Most popular 
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browsers have Verisign’s key embedded in their bi- 
naries. One benefit of this approach is that Verisign 
does not have to be on-line when the browser con- 
nects to a certified Web server. However, this comes 
at the cost of complicating certificate revocation to 
the point that in practice no one does it. 


In contrast, SFS uses file systems to certify public 
keys of servers. SF'S certificate authorities are noth- 
ing more than ordinary file systems serving sym- 
bolic links that translate human-readable names into 
public keys that name file servers [17]. For ex- 
ample, if Verisign acted as an SFS certificate au- 
thority, client administrators would likely create 
symbolic links from their local disks, for instance 
/verisign, to Verisign’s self-certifying pathname— 
a pathname containing the public key of Verisign’s 
file system. This file system would in turn contain 
symbolic links to other SFS file systems. For exam- 
ple, /verisign/NYU might be a symbolic link to a 
self-certifying pathname for an SFS file server that 
Verisign calls NYU. 


Unlike traditional certificate authorities, SFS cer- 
tificate authorities get queried interactively. This 
simplifies certificate revocation, since revoking a key 
amounts to removing the symbolic link. However, 
it also places high integrity, availability, and perfor- 
mance demands on file systems serving as on-line 
certificate authorities. 


By running certificate authorities as SFS read- 
only file systems, we can address these needs. The 
SFS read-only file system improves performance by 
making the amount of cryptographic computation 
proportional to the file system’s size and rate of 
change, rather than to the number of clients con- 
necting. SFS read-only also improves integrity by 
freeing SFS certificate authorities from the need to 
keep any on-line copies of their private keys. Fi- 
nally, SFS read-only improves availability because it 
can be replicated on untrusted machines. 


An administrator adds certificates to its SFS file 
system by adding new symbolic links. The database 
is updated once a day, similarly to second-level DNS 
updates. The administrator (incrementally) repli- 
cates the database to other servers. 


The certificate authority database (and thus its 
certificates) might be valid for one day. The cer- 
tificate that we bought from Verisign for our Web 
server is valid for 12 months. If the private key of 
an SFS server is compromised, then the next day the 
certificate will be out of the on-line database. 


4th Symposium on Operating Systems Design and Implementation 


SFS certificate authorities also support key revo- 
cation certificates to revoke public keys of servers 
explicitly. The key revocation certificates are self- 
authenticating [17] and signed with the private key 
of the compromised server. Verisign could, for ex- 
ample, maintain an SFS certificate authority that 
has a directory to which users upload revocation 
certificates for some fee; since the certificates are 
self-authenticating, Verisign does not have to cer- 
tify them. Clients check this directory when they 
perform an on-line check for key certificates. Be- 
cause checks can be performed interactively, this ap- 
proach works better than X.509 certificate revoca- 
tion lists [10]. 


5.2 Software Distribution 


Sites distributing popular software have high 
availability, integrity, and performance needs. Open 
software is often replicated at several mirrors to sup- 
port a high number of concurrent downloads. If 
users download a distribution with anonymous FTP, 
they have low data integrity: a user cannot tell 
whether he is downloading a trojan-horse version in- 
stead of the correct one. If users connect through 
the Secure Shell (SSH) or secure HTTP, then the 
server’s throughput is low because of cryptographic 
operations it must perform. Furthermore, that solu- 
tion doesn’t protect against attacks where the server 
is compromised and the attacker replaces a program 
on the server’s disk with a trojan horse. 


By distributing software through SFS read-only 
servers, one can provide integrity, performance, and 
high availability. Users with sfsrocd can even 
browse the distribution as a regular file system and 
compile the software straight from the sources stored 
on the SFS file system. sfsrocd will transparently 
check the authenticity of the file system data. To 
distribute new versions of the software, the admin- 
istrator simply updates the database. Users with 
only a browser could get all the benefits by just con- 
necting through a Web-to-SFS proxy to the SFS file 
system. 


Software distribution using the read-only file sys- 
tems complements signed RPMs. First, RPMs do 
not provide any revocation support; the signature 
on an RPM is good forever. Second, there is no easy 
way to determine whether an RPM is recent; an at- 
tacker can give a user an older version of a software 
package without the user knowing it. Third, there 
is no easy method for signing a collection of RPMs 
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that constitute a single system. For an example, 
there is currently no way of cryptographically veri- 
fying that one has the complete Linux RedHat 6.2 
distribution (or all necessary security patches for a 
release). Using SFS read-only, Red Hat could se- 
curely distribute the complete 6.2 release, providing 
essentially the same security guarantees as a physical 
CDROM distribution. 


6 Performance 


This section presents the results of measurements 
to support the claims that (1) SFS read-only pro- 
vides acceptable application performance and (2) 
SFS read-only scales well with the number of clients. 


To support the first claim, we measure the per- 
formance of microbenchmarks and a large software 
compilation. We compare the performance of the 
SFS read-only file system with the performance on 
the local file system, insecure NFS, and the secure 
SFS read-write file system. 


To support the second claim, we measure the max- 
imum number of connections per server and the 
throughput of software downloads with an increas- 
ing number of clients. 


We expect that the main factors affecting SFS 
read-only performance are the user-level implemen- 
tation in the client, hash verification in the client, 
and database lookups on the server. 


6.1 Experimental setup 


We measured performance on 550 MHz Pen- 
tium IIIs running FreeBSD 3.3. The client and 
server were connected by 100 Mbit, full-duplex, 
switched Ethernet. Each machine had a 100 Mbit 
Tulip Ethernet card, 256 Mbytes of memory, and an 
IBM 18ES 9 Gigabyte SCSI disk. In sfsrocd, the in- 
ode, indirect-block, and directory entry caches each 
have a maximum of 512 entries, while the file-block 
cache has maximum of 64 entries. Maximum TCP 
throughput between client and server, as measured 
by ttcp [25], was 11.31 Mbyte/sec. 


Because the certificate authority benchmark in 
Section 6.4 requires many CPU cycles on the client, 
we also employed two 700 MHz Athlons running 
OpenBSD 2.7. Each Athlon had a 100 Mbit Tulip 
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Local 


Figure 6: Time to sequentially read 1,000 1 Kbyte 
files. Local is FreeBSD’s local FFS file system on the 
server. The local file system was tested with a cold 
cache. The network tests were applied to warmed 
server caches, but cold client caches. RW, RO, and 
RONV denote respectively the read-write protocol, 
the read-only protocol, and the read-only protocol 
with no verification. 


Ethernet card and 128 Mbytes of memory. Max- 
imum TCP throughput between an Athlon and 
the FreeBSD server, as measured by ttcp, was 
11.04 Mbyte/sec. The Athlon machines generated 
the client SSL and SFSRW requests; we report the 
sum of the performance measured on the two ma- 
chines. 


For all experiments we report the average of five 
runs. 


6.2. Microbenchmarks 


To evaluate the performance of the SFS read-only 
system, we perform small and large file microbench- 
marks. 


6.2.1 Small file benchmark 


We use the read phases of the LFS _bench- 
marks [21] to obtain a basic understanding of sin- 
gle client/single server performance. Figure 6 shows 
the latency of sequentially reading 1,000 1 Kbyte 
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Cost (sec) 


NFS loopback 
Computation in client 
Communication with server 





Table 1: Breakdown of SFS read-only performance 
reported in Fig 6. 


files on the different file systems. The files contain 
random data and are distributed evenly across ten 
directories. For the read-only and NFS experiments, 
all samples were within 0.4% of the average. For the 
read-write experiment, all samples were within 2.7% 
of the average. For the local file system, all samples 
were within 6.9% (0.4 seconds) of the average. 


As expected, the SFS read-only server performs 
better than the SFS read-write server (2.43 vs. 3.27 
seconds). The read-only file server performs worse 
than NFSv3 over TCP (2.43 vs. 1.14 seconds). To 
understand the performance of the read-only file 
server, we break down the 2.43 seconds spent in the 
read-only client (see Table 1). 


To measure the cost of the user-level implementa- 
tion we measured the time spent in NFS loopback. 
We used the fchown operation against a file in a 
read-only file system to measure the time spent in 
the user-level NFS loopback file system. This oper- 
ation generates NFS RPCs from the kernel to the 
read-only client, but no traffic between the client 
and the server. The average over 1000 fchown op- 
erations is 167 wsec. By contrast, the average for 
attempting an fchown of a local file with permission 
denied is 2.4 psec. The small file benchmark gener- 
ates 4015 NFS loopback RPCs. Hence, the overhead 
of the client’s user-level implementation is at least 
(167 psec - 2.4 psec) * 4015 = 0.661 seconds. 


We also measured the CPU time spent during the 
small file benchmark in the read-only client at 1.386 
seconds. With verification disabled, this drops to 
1.300 seconds, indicating that for this workload, file 
handle verification consumes very little CPU time. 


To measure the time spent communicating with 
the read-only server, we timed the playback of a 
trace of the 2101 getdata RPCs of the benchmark 
to the read-only server. This took 0.507 seconds. 


These three measurements total to 2.55 seconds. 
With an error margin of 5%, this accounts for the 
2.43 seconds to run the benchmark. We attribute 
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Figure 7: Throughput of sequential and random 
reads of a 40 Mbyte file. The experimental condi- 
tions are the same as in Figure 6. 


this error to a small amount of double counting of 
cycles between the NFS loopback measurement and 
the computation in the client. 


The cryptography accounts for very little of the 
time. The CPU time spent on verification is only 
0.086 seconds. Moreover, end-to-end measurements 
show that data verification has little impact on per- 
formance. RONV performs slightly better than RO 
(2.31 vs. 2.43 seconds). Therefore, any optimization 
will have to focus on the non-cryptographic portions 
of the system. 


6.2.2 Large file benchmark 


Figure 7 shows the performance of sequentially and 
randomly reading a large (40 Mbyte) file containing 
random data. We read in blocks of 8 KBytes. In 
the network experiments, the file is in the server’s 
cache, but not in the client’s cache. Thus, we are not 
measuring the server’s disk. This isolates the soft- 
ware overhead of cryptography and SF%S’s user-level 
design. For the local file system, all samples were 
within 1.4% of the average. For NFSv3 over UDP 
and the read-write experiments, all samples were 
within 1% of the average. For NFSv3 over TCP and 
the read-only experiments, all samples were within 
4.3% of the average. This variability and the poor 


USENIX Association 


USENIX Association 


NFSv3 over TCP performance appears to be due to 
a pathology of FreeBSD. 


The SFS read-only server performs better than the 
read-write server because the read-only server per- 
forms no on-line cryptographic operations. On the 
sequential workload, verification costs 1.4 Mbyte/s 
in throughput. NFSv3 over TCP performs substan- 
tially better (9.8 vs. 6.5 Mbyte/s) than the read- 
only file system without verification, even though 
both run over TCP and do similar amounts of work; 
the main difference is that NFS is implemented in 
the kernel. 


If the large file contains only blocks of zeros, SFS 
read-only obtains a throughput of 17 Mbyte/s since 
all blocks hash to the same handle. In this case, 
the measurement is dominated by the throughput of 
loop-back NFSv3 over UDP on the client machine. 


6.3 Software distribution 


To evaluate how well the read-only file system per- 
forms on a larger application benchmark, we com- 
piled (with optimization and debugging disabled) 
Emacs 20.6 with a local build directory and a re- 
mote source directory. The results are shown in 
Figure 8. The RO experiment performs 1% worse 
(1 second) than NFSv3 over UDP and 4% better 
(3 seconds) than NFSv3 over TCP. Disabling in- 
tegrity checks in the read-only file system (RONV) 
does not speed up the compile because our caches 
absorb the cost of hash verification. However, dis- 
abling caching does decrease performance (RONC). 
During a single Emacs compilation, the read-only 
server consumes less than 1% of its CPU while the 
read-only client consumes less than 2% of its CPU. 
This demonstrates that the read-only protocol in- 
troduces negligible performance degradation in an 
application benchmark. 


To evaluate how well sfsrosd scales, we took a 
trace of a single client compiling the Emacs 20.6 
source tree, repeatedly played the trace to the server 
from an increasing number of simulated, concurrent 
clients, and plotted the aggregate throughput deliv- 
ered by sfsrosd. The results are shown in Figure 9. 
Each sample represents the throughput of playing 
traces for 100 seconds. Each trace consists of 1428 
RPCs. With 300 simultaneous clients, the server 
consumes 96% of the CPU. 


With more than 300 clients, the FreeBSD server 
reboots because of a bug in its TCP implementation. 
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Figure 8: Compiling the Emacs 20.6 source. Local 
is FreeBSD’s local FFS file system on the server. 
The local file system was tested with a cold cache. 
The network tests were applied to warmed server 
caches, but cold client caches. RW, RO, RONV, and 
RONC denote respectively the read-write protocol, 
the read-only protocol, the read-only protocol with 
no verification, and the read-only protocol with no 
caching. 


We replaced the FreeBSD server with an OpenBSD 
server and measured that sfsrosd maintains a rate 
of 10 Mbyte/s of file system data up to 600 simulta- 
neous clients. 


6.4 Certificate authority 


To evaluate whether the read-only file system per- 
forms well enough to function as an on-line certifi- 
cate authority, we compare the number of connec- 
tions a single read-only file server can sustain with 
the number of connections to the SFS read-write 
server, the number of SSL connections to an Apache 
web server, and the number of HTTP connections 
to an Apache server. 


The SFS servers use 1024-bit keys. The SFS read- 
write server performs one Rabin-Williams decryp- 
tion per connection while the SFS read-only server 
performs no on-line cryptographic operations. The 
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Figure 9: The aggregate throughput delivered by the 
read-only server for an increasing number of clients 
simultaneously compiling the Emacs 20.6 source. 
The number of clients is plotted on a log scale. 


Web server was Apache 1.3.12 with OpenSSL 0.9.5a 
and ModSSL 2.6.3-1.3.12. Our SSL ServerID certifi- 
cate and Verisign CA certificate use 1024-bit RSA 
keys. All the SSL connections use the TLSv1 cipher 
suite consisting of Ephemeral Diffie-Hellman key ex- 
change, DES-CBC3 for confidentiality, and SHA-1 
HMAC for integrity. 


To generate enough load to saturate the servers, 
we wrote a simple client program that sets up 
connections, reads a small file containing a self- 
certifying path, and terminates the connection as 
fast as it can. We run this client program simulta- 
neously on two OpenBSD machines. In all experi- 
ments, the certificate is in the main memory of the 
server, so we are limited by software performance, 
not by disk performance. This scenario is realistic 
since we envision that important on-line certificate 
authorities would have large enough memories to 
avoid frequent disk accesses, like DNS second-level 
servers. 


The SFS read-only protocol performs client-side 
name resolution, unlike the Web server which 
performs server-side name resolution. We mea- 
sured both single-component and multi-component 
lookups. (For instance, http://host/a.html 
causes a single-component lookup while http:/ 
/host/a/b/c/d.html caused a multi-component 
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Figure 10: Maximum sustained certificate down- 
loads per second. HTTP is an insecure Web server, 
SSL is a secure Web server, SFSRW is the secure 
SFS read-write file system, and SFSRO is the secure 
read-only file system. Light bars represent single- 
component lookups while dark bars represent multi- 
component lookups. 


lookup.) The read-only client makes a linear num- 
ber of RPCs with respect to the number of compo- 
nents in a lookup. On the other hand, the HTTP 
client makes only one HTTP request regardless of 
the number of components in the URL path. 


The HTTP and SSL single- and multi-component 
tests consist of a GET /symlink.txt and GET 
/one/two/three/symlink.txt respectively, where 
symlink.txt contains the string /sfs/new-york. 
lcs.mit.edu: bzccShder/cuc86kf6qswyx6yuemn 
w69/. The SFSRO and SFSRW tests consist of 
comparable operations. We play a trace of reading 
a symlink that points to the above self-certifying 
path. The single-component SFSRO trace consists 
of 5 RPCs to read a symlink in the top-level 
directory. The multi-component trace consists of 11 
RPCs to read a symlink in a directory three levels 
deep. The single-component SFSRW trace consists 
of 6 RPCs while the multi-component trace consists 
of 12 RPCs. 


Figure 10 shows that the read-only server scales 
well. For single-component lookups, the SFS read- 
only server can process 26 times more certificate 
downloads than the SFS read-write server because 
the read-only server performs no on-line crypto- 
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graphic operations. The read-write server is bottle- 
necked by public key decryptions, which each take 
24 msec. Hence, the read-write server can at best 
achieve 38 (1000/24) connections per second. By 
comparing the read-write server to the Apache Web 
server with SSL, we see that the read-write server is 
in fact quite efficient; the SSL protocol requires more 
and slower cryptographic operations on the server 
than the SFS read-only protocol. 


By comparing the read-only server with an inse- 
cure Apache server, we can conclude that the read- 
only server is a good platform for serving read-only 
data to many clients; the number of connections per 
second is only 32% lower than that of the insecure 
Apache server. In fact, the performance of SFS read- 
only is within an order of magnitude of the perfor- 
mance of a DNS root server, which according to Net- 
work Solutions can sustain about 4,000 lookups per 
second (DNS uses UDP instead of TCP). Since the 
DNS root servers can support on-line name resolu- 
tion for the Internet, this comparison suggests that 
it is reasonable to build a distributed on-line certifi- 
cate authority using SFS read-only servers. 


A multi-component lookup is faster with HTTP 
than with SFSRO. The SFSRO client must make two 
RPCs per component. Hence, there is a slowdown 
for deep directories. In practice, the impact on per- 
formance will depend on whether clients do multi- 
component lookups once, and then never look at the 
same directory again, or rather, amortize the cost of 
walking the file system over multiple lookups. In any 
situation in which a single read-only client does mul- 
tiple lookups in the same directory, the client should 
have performance similar to the single-component 
case because it will cache the components along the 
path. 


In the case of our CA benchmark, it is realistic to 
expect all files to reside in the root directory. Thus, 
this usage scenario minimizes people’s true multi- 
component needs. On the other hand, if the root 
directory is huge, then SFS read-only will require 
a logarithmic number of round-trips for a lookup. 
However, SFS read-only will still outperform HTTP 
on a typical file system because Unix typically per- 
forms directory lookups in time linear in the num- 
ber of directory entries; SFS read-only performs a 
lookup in logarithmic time in the number of direc- 
tory entries. 


7 Conclusion 


The SFS read-only file system is a distributed file 
system that allows a high number of clients to se- 
curely access public, read-only data. The data of the 
file system is stored in a database, which is signed 
off-line with the private key of the file system. The 
private key of the file system does not have to be 
on-line, allowing it to be replicated on many un- 
trusted machines. To allow for frequent updates, the 
database can be replicated incrementally. The read- 
only file systems pushes the cost of cryptographic 
operations from the server to the clients, allowing 
read-only servers to be simple and to support many 
clients. An implementation of the design in the con- 
text of the SFS global file system confirms that the 
read-only file system can support a large number of 
clients, while providing individual clients with ac- 
ceptable application performance. 
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Abstract 


Overcast is an application-level multicasting system 
that can be incrementally deployed using today’s 
Internet infrastructure. These properties stem from 
Overcast’s implementation as an overlay network. 
An overlay network consists of a collection of nodes 
placed at strategic locations in an existing network 
fabric. These nodes implement a network abstrac- 
tion on top of the network provided by the under- 
lying substrate network. 


Overcast provides scalable and reliable single-source 
multicast using a simple protocol for building effi- 
cient data distribution trees that adapt to changing 
network conditions. To support fast joins, Overcast 
implements a new protocol for efficiently tracking 
the global status of a changing distribution tree. 


Results based on simulations confirm that Over- 
cast provides its added functionality while perform- 
ing competitively with IP Multicast. Simulations 
indicate that Overcast quickly builds bandwidth- 
efficient distribution trees that, compared to IP 
Multicast, provide 70%-100% of the total band- 
width possible, at a cost of somewhat less than twice 
the network load. In addition, Overcast adapts 
quickly to changes caused by the addition of new 
nodes or the failure of existing nodes without caus- 
ing undue load on the multicast source. 


1 Introduction 


Overcast is motivated by real-world problems faced 
by content providers using the Internet today. How 
can bandwidth-intensive content be offered on de- 
mand? How can long-running content be offered to 
vast numbers of clients? Neither of these challenges 
are met by today’s infrastructure, though for dif- 
ferent reasons. Bandwidth-intensive content (such 
as 2Mbit/s video) is impractical because the bot- 
tleneck bandwidth between content providers and 


consumers is considerably less than the natural con- 
sumption rate of such media. With currently avail- 
able bandwidth, a 10-minute news clip might require 
an hour of download time. On the other hand, large- 
scale (thousands of simultaneous viewers) use of 
even moderate-bandwidth live video streams (per- 
haps 128Kbit/s) is precluded because network costs 
scale linearly with the number of consumers. 


Overcast attempts to address these difficulties by 
combining techniques from a number of other sys- 
tems. Like IP Multicast, Overcast allows data to 
be sent once to many destinations. Data are repli- 
cated at appropriate points in the network to mini- 
mize bandwidth requirements while reaching multi- 
ple destinations. Overcast also draws from work in 
caching and server replication. Overcast’s multicast 
capabilities are used to fill caches and create server 
replicas throughout a network. Finally Overcast is 
designed as an overlay network, which allows Over- 
cast to be incrementally deployed. As nodes are 
added to an Overcast system the system’s benefits 
are increased, but Overcast need not be deployed 
universally to be effective. 


An Overcast system is an overlay network consist- 
ing of a central source (which may be replicated 
for fault tolerance), any number of internal Over- 
cast nodes (standard PCs with permanent storage) 
sprinkled throughout a network fabric, and stan- 
dard HTTP clients located in the network. Using 
a simple tree-building protocol, Overcast organizes 
the internal nodes into a distribution tree rooted 
at the source. The tree-building protocol adapts 
to changes in the conditions of the underlying net- 
work fabric. Using this distribution tree, Overcast 
provides large-scale, reliable multicast groups, espe- 
cially suited for on-demand and live data delivery. 
Overcast allows unmodified HTTP clients to join 
these multicast groups. 


Overcast permits the archival of content sent to mul- 
ticast groups. Clients may specify a starting point 
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when joining an archived group, such as the begin- 
ning of the content. This feature allows a client to 
“catch up” on live content by tuning back ten min- 
utes into a stream, for instance. In practice, the 
nature of a multicast group will most often deter- 
mine the way it is accessed. A group containing 
stock quotes will likely be accessed live. A group 
containing a software package will likely be accessed 
from start to finish; “live” would have no meaning 
for such a group. Similarly, high-bandwidth con- 
tent can not be distributed live when the bottleneck 
bandwidth from client to server is too small. Such 
content will always be accessed relative to its start. 


We have implemented Overcast and used it to create 
a data distribution system for businesses. Most cur- 
rent users distribute high quality video that clients 
access on demand. These businesses operate ge- 
ographically distributed offices and need to dis- 
tribute video to their employees. Before using Over- 
cast, they met this need with low resolution Web- 
accessible video or by physically reproducing and 
mailing VHS tapes. Overcast allows these users 
to distribute high-resolution video over the Inter- 
net. Because high quality videos are large (Approx- 
imately 1 Gbyte for a 30 minute MPEG-2 video), 
it is important that the videos are efficiently dis- 
tributed and available from a node with high band- 
width to the client. To a lesser extent, Overcast is 
also being used to broadcast live streams. Existing 
Overcast networks typically contain tens of nodes 
and are scheduled to grow to hundreds of nodes. 


The main challenge in Overcast is the design and 
implementation of protocols that can build effi- 
cient, adaptive distribution trees without knowing 
the details of the substrate network topology. The 
substrate network’s abstraction provides the ap- 
pearance of direct connectivity between all Over- 
cast nodes. Our goal is to build distribution trees 
that maximize each node’s bandwidth from the 
source and utilize the substrate network topology 
efficiently. For example, the Overcast protocols 
should attempt to avoid sending data multiple times 
over the same physical link. Furthermore, Overcast 
should respond to transient failures or congestion in 
the substrate network. 


Consider the simple network depicted in Figure 1. 
The network substrate consists of a root node (R), 
two Overcast nodes (O), a router, and a number 
of links. The links are labeled with bandwidth in 
Mbit/s. There are three ways of organizing the root 
and the Overcast nodes into a distribution tree. The 
organization shown optimizes bandwidth by using 





Figure 1: An example network and Overcast topology. The 
straight lines are the links in the substrate network. These 
links are labeled with bandwidth in Mbit/s. The curved lines 
represent connections in the Overlay network. S represents 
the source, O represents two Overcast nodes. 


the constrained link only once. 


The contributions of this paper are: 


e A novel use of overlay networks. We describe 
how reliable, highly-scalable, application-level 
multicast can be provided by adding nodes that 
have permanent storage to the existing network 
fabric. 


e A simple protocol for forming efficient and scal- 
able distribution trees that adapt to changes in 
the conditions of the substrate network without 
requiring router support. 


e A novel protocol for maintaining global status 
at the root of a changing distribution tree. This 
state allows clients to join an Overcast group 
quickly while maintaining scalability. 


e Results from simulations that show Overcast is 
efficient. Overcast can scale to a large num- 
ber of nodes; its efficiency approaches router- 
based systems; it quickly adjusts to configura- 
tion changes; and a root can track the status of 
an Overcast network in a scalable manner. 


Section 2 details Overcast’s relation to prior work. 
Overcast’s general structure is examined in Section 
3, first by describing overlay networks in general, 
then providing the details of Overcast. Section 
4 describes the operation of the Overcast network 
performing reliable application-level multicast. Fi- 
nally, Section 5 examines Overcast’s ability to build 
a bandwidth-efficient overlay network for multicas- 
ting and to adapt efficiently to changing network 
conditions. 
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2 Related Work 


Overcast seeks to marry the bandwidth savings of 
an IP Multicast distribution tree with the reliability 
and simplicity of store-and-forward operation using 
reliable communication between nodes. Overcast 
builds on research in IP multicast, content distri- 
bution (caching, replication, and content routing), 
and overlay networks. We discuss each in turn. 


IP Multicast IP Multicast [11] is designed to pro- 
vide efficient group communication as a low level 
network primitive. Overcast has a number of ad- 
vantages over IP Multicast. First, as it requires no 
router support, it can be deployed incrementally on 
existing networks. Second, Overcast provides band- 
width savings both when multiple clients view con- 
tent simultaneously and when multiple clients view 
content at different times. Third, while reliable mul- 
ticast is the subject of much research [19, 20], prob- 
lems remain when various links in the distribution 
tree have widely different bandwidths. A common 
strategy in such situations is to decrease the fidelity 
of content over lower bandwidth links. Although 
such a strategy has merit when content must be de- 
livered live, Overcast also supports content types 
that require bit-for-bit integrity, such as software. 


Express [15] is a single-source multicasting system 
that addresses some of IP Multicast’s deficits. Ex- 
press alleviates difficulties relating to IP Multicast’s 
small address space, susceptibility to denial of ser- 
vice attacks, and billing difficulties which may lie 
at the root of IP Multicast’s lack of deployment 
on commercial networks. In these three respects 
Overcast bears a great deal of similarity to Ex- 
press. Overcast differs mainly by stressing deploy- 
ability and flexibility. Overcast does not require 
router modifications, simplifying adoption and in- 
creasing flexibility. Although Overcast provides a 
useful range of functionality, we recognize that there 
needs for which Overcast may not be suited. Ex- 
press standardizes a single model in the router which 
works to lock out applications with different needs. 


Content Distribution Systems Others have ad- 
vocated distributing content servers in the net- 
work fabric, from initial proposals [10] to larger 
projects, such as Adaptive Caching [26], Push 
Caching [14], Harvest [8], Dynamic Hierarchical 
Caching [7], Speculative Data Dissemination [6], 
and Application-Level Replication [4]. Overcast ex- 
tends this previous work by building an overlay net- 
work using a self-organizing algorithm. This algo- 
rithm, operating continuously, not only eliminates 


the need for manually determined topology infor- 
mation when the overlay network is created, but 
also reacts transparently to the addition or removal 
of nodes in the running system. Initialization, ex- 
pansion, and fault tolerance are unified. 


A number of service providers (e.g., Adero, Aka- 
mai, and Digital Island) operate content distribu- 
tion networks, but in-depth information describing 
their internals is not public information. FastFor- 
ward’s product is described below as an example of 
an overlay network. 


Overlay Networks A number of research groups 
and service providers are investigating services 
based on overlay networks. In particular, many of 
these services, like Overcast, exist to provide some 
form of multicast or content distribution. These in- 
clude End System Multicast [16], Yoid [13] (formerly 
Yallcast), X-bone [24], RMX [9], FastForward [1], 
and PRISM [5]. All share the goal of providing 
the benefits of IP multicast without requiring di- 
rect router support or the presence of a physical 
broadcast medium. However, except Yoid, these ap- 
proaches do not exploit the presence of permanent 
storage in the network fabric. 


End System Multicast is an overlay network that 
provides small-scale multicast groups for telecon- 
ferencing applications; as a result the End System 
Multicast protocol (Narada) is designed for multi- 
source multicast. The Overcast protocols different 
from Narada in order to support large-scale multi- 
cast groups. 


Yoid is a generic architecture for overlay networks 
with a number of new protocols, which are in devel- 
opment. The most striking difference between Yoid 
and Overcast is in approach. Yoid strives to be a 
general purpose overlay network and content distri- 
bution toolkit, addressing applications as diverse as 
netnews, streaming broadcasts, and bulk email dis- 
tribution. While these goals are laudable, we believe 
that because Overcast is more focused on providing 
single-source multicast our protocols are simpler to 
understand and implement. Nonetheless, there re- 
mains a great deal of similarity between Overcast 
and Yoid, including url-like group naming, the use 
of disk space to “time-shift” multicast distribution, 
and automatic tree configuration. 


X-bone is also a general-purpose overlay network 
that can support many different network services. 
The overlay networks formed by X-bone are meshes, 
which are statically configured. 
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RMX focuses on real-time reliable multicast. As 
such, its focus is on reconciling the heterogenous ca- 
pabilities and network connections of various clients 
wilh the need for reliability. Therefore their work 
focuses on semantic rather than data reliability. For 
instance, RMX can be used to change high resolu- 
tion images into progressive JPEGs before trans- 
mittal to underprovisioned clients. Our work is less 
concerned with interactive response times. Overcast 
is designed for content that clients are interested in 
only at full fidelity, even if it means that the content 
does not become available to all clients at the same 
time. 


FastForward Networks produces a system sharing 
many properties with RMX. Like RMX, FastFor- 
ward focuses on real-time operation and includes 
provisions for intelligently decreasing the band- 
width requirements of rich media for low-bandwidth 
clients. Beyond this, FastForward’s product differs 
from Overcast in that its distribution topology is 
statically configured by design. Within this stati- 
cally configured topology, the product can pick dy- 
namic routes. In this way FastForward allows ex- 
perts to configure the topology for better perfor- 
mance and predictability while allowing for a lim- 
ited degree of dynamism. Overcast’s design seeks 
to minimize human intervention to allow its overlay 
networks to scale to thousands of nodes. Similarly, 
FastForward achieves fault tolerance by statically 
configuring distribution topologies to avoid single 
points of failure, while Overcast seeks to dynami- 
cally reconfigure its overlay in response to failures. 


PRISM is an architecture for distributing streaming 
media over IP. Its architecture bears some similarity 
to Overcast, but their work appears focused on the 
naming of content and the design of interior nodes of 
the system. PRISM’s high level design includes an 
overlay based content distribution mechanism, but 
it is assumed that such a system can be “plugged 
in” to the rest of PRISM. Overcast could provide 
that mechanism. 


Active Services Active Services [2] is a frame- 
work for implementing services at the application- 
level throughout the fabric of the network. In that 
sense, there is a strong similarity in mindset between 
our works. However, Active Services must contend 
with the difficulty of sharing the resources of a sin- 
gle computer among multiple services, a difficulty 
we avoid by using dedicated nodes. Perhaps be- 
cause of this challenge, Active Service applications 
have focused on real-time multimedia streaming, an 
application with transient resource needs. Our ap- 
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plication uses large amounts of disk space for long 
periods of time which is problematic in a shared en- 
vironment. 


Our observation is that one-time hardware costs do 
not drive the total costs of systems on the scale 
that we propose. Total cost is dominated by band- 
width, maintenance, and continual hardware obso- 
lescence. Therefore Overcast seeks to minimize the 
use of bandwidth, cut maintenance costs by sim- 
plifying node deployment, and avoid obsolescence 
by structuring the system to allow older nodes to 
continue to contribute to the total efficiency of the 
overlay network. 


Active Networks One may view overlay networks 
as an alternative implementation of active net- 
works [23]. In active networks, new protocols and 
application-code can dynamically be downloaded 
into routers, allowing for rapid innovation of net- 
work services. Overcast avoids some of the hard 
problems of active networks by focusing on a single 
application; it does not have to address the prob- 
lems created by dynamic downloading of code and 
sharing resources among multiple competing appli- 
cations. Furthermore, since Overcast requires no 
changes to existing routers, it is easier to deploy. 
The main challenge for Overcast is to be competi- 
tive with solutions that are directly implemented on 
the network level. 


3 The Overcast Network 


This section describes the overlay network created 
by the Overcast system. First, we argue the ben- 
efits and drawbacks of using an overlay network. 
After concluding that an overlay network is appro- 
priate for the task at hand, we explore the particular 
design of an overlay network to meet Overcast’s de- 
mands. Todo so, we examine the key design require- 
ment of the Overcast network-—single source distri- 
bution of bandwidth-intensive media on today’s In- 
ternet infrastructure. Finally we illustrate the use 
of Overcast with an example. 


3.1 Why overlay? 


Overcast was designed to meet the needs of con- 
tent providers on the Internet. This goal led us to 
an overlay network design. To understand why we 
chose an overlay network, we consider the benefits 
and drawbacks of overlays. 
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An overlay network provides advantages over both 
centrally located solutions and systems that advo- 
cate running code in every router. An overlay net- 
work is: 


Incrementally Deployable An overlay network 
requires no changes to the existing Internet infras- 
tructure, only additional servers. As nodes are 
added to an overlay network, it becomes possible to 
control the paths of data in the substrate network 
with ever greater precision. 


Adaptable Although an overlay network abstrac- 
tion constrains packets to flow over a constrained 
set of links, that set of links is constantly being 
optimized over metrics that matter to the applica- 
tion. For instance, the overlay nodes may opti- 
mize latency at the expense of bandwidth. The De- 
tour Project [21] has discovered that there are often 
routes between two nodes with less latency than the 
routes offered by today’s IP infrastructure. Overlay 
networks can find and take advantage of such routes. 


Robust By virtue of the increased control and the 
adaptable nature of overlay networks, an overlay 
network can be more robust than the substrate fab- 
ric. For instance, with a sufficient number of nodes 
deployed, an overlay network may be able to guar- 
antee that it is able to route between any two nodes 
in two independent ways. While a robust substrate 
network can be expected to repair faults eventu- 
ally, such an overlay network might be able to route 
around faults immediately. 


Customizable Overlay nodes may be multi- 
purpose computers, easily outfitted with whatever 
equipment makes sense. For example, Overcast 
makes extensive use of disk space. This allows 
Overcast to provide bandwidth savings even when 
content is not consumed simultaneously in different 
parts of the network. 


Standard An overlay network can be built on the 
least common denominator network services of the 
substrate network. This ensures that overlay traffic 
will be treated as well as any other. For example, 
Overcast uses TCP (in particular, HTTP over port 
80) for reliable transport. ,TCP is simple, well un- 
derstood, network friendly, and standard. Alterna- 
tives, such as a “home grown” UDP protocol with 
retransmissions, are less attractive by all these mea- 
sures. For better or for worse, creativity in reliable 
transport is a losing battle on the Internet today. 


On the other hand, building an overlay network 
faces a number of interesting challenges. An overlay 
network must address: 
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Management complexity The manager of an 
overlay network is physically far removed from the 
machines being managed. Routine maintenance 
must either be unnecessary or possible from afar, 
using tools that do not scale in complexity with the 
size of the network. Physical maintenance must be 
minimized and be possible by untrained personnel. 


The real world In the real world, IP does not 
provide universal connectivity. A large portion of 
the Internet lies behind firewalls. A significant and 
growing share of hosts are behind Network Address 
Translators (NATs), and proxies. Dealing with 
these practical issues is tedious, but crucial to adop- 
tion. 


Inefficiency An overlay can not be as efficient as 
code running in every router. However, our observa- 
tion is that when an overlay network is small, the in- 
efficiency, measured in absolute terms, will be small 
as well — and as the overlay network grows, its ef- 
ficiency can approach the efficiency of router based 
servcies. 


Information loss Because the overlay network is 
built on top of a network infrastructure (IP) that 
offers nearly complete connectivity (limited only by 
firewalls, NATs, and proxies), we expend consider- 
able effort deducing the topology of the substrate 
network. 


The first two of these problems can be addressed 
and nearly eliminated by careful design. To ad- 
dress management complexity, management of the 
entire overlay network can be concentrated at a sin- 
gle site. The key to a centralized-administration 
design is guaranteeing that newly installed nodes 
can boot and obtain network connectivity without 
intervention. Once that is accomplished, further in- 
structions may be read from the central manage- 
ment server. 


Firewalls, NATs and HTTP proxies complicate 
Overcast’s operation in a number of ways. Fire- 
walls force Overcast to open all connections “up- 
stream” and to communicate using HTTP on port 
80. This allows an Overcast network to extend ex- 
actly to those portions of the Internet that allow 
web browsing. NATs are devices used to multiplex 
a small set of IP addresses (often exactly one) over a 
number of clients. The clients are configured to use 
the NAT as their default router. At the NAT, TCP 
connections are rewritten to use one of the small 
number of IP addresses managed by the NAT. TCP 
port numbers allow the NAT to demultiplex return 
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packets back to the correct client. The complication 
for Overcast is that client IP addresses are obscured. 
All Overcast nodes behind the NAT appear to have 
the same IP address. HTTP proxies have the same 
effect. 


Although private IP addresses are never directly 
used by external Overcast nodes, there are times 
when an external node must correctly report the 
private IP address of another node. For example, 
an external node may have internal children. Dur- 
ing tree building a node must report its childrens’ 
addresses so that they may be measured for suitabil- 
ity as parents themselves. Only the private address 
is suitable for such purposes. To alleviate this com- 
plication all Overcast messages contain the sender’s 
IP address in the payload of the message. 


The final two disadvantages are not so easily dis- 
missed. They represent the true tradeoff between 
overlay networks and ubiquitous router based soft- 
ware. For Overcast, the goal of instant deployment 
is important enough to sacrifice some measure of 
efficiency. However, the amount of inefficency in- 
troduced is a key metric by which Overcast should 
be judged. 


3.2 Single-Source Multicast 


Overcast is a single-source multicast system. This 
contrasts with IP Multicast which allows any mem- 
ber of a multicast group to send packets to all 
other members of the group. Beyond the fact that 
this closely models our intended application domain, 
there are a number of reasons to pursue this partic- 
ular refinement to the IP Multicast model. 


Simplicity Both conceptually and in implementa- 
tion, a single-source system is simpler than an any- 
source model. For example, a single-source provides 
an obvious rendezvous point for group joins. 


Optimization It is difficult to optimize the struc- 
ture of the overlay network without intimate knowl- 
edge of the substrate network topology. This only 
becomes harder if the structure must be optimized 
for all paths [16]. 


Address space Single-source multicast groups pro- 
vide a convenient alternative to the limited IP Mul- 
ticast address space. The namespace can be par- 
titioned by first naming the source, then allowing 
further subdivision of the source’s choosing. In con- 
trast, IP Multicast’s address space is flat, limited, 
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and without obvious administration to avoid colli- 
sions amongst new groups. 


On the other hand, a single-source model clearly of- 
fers reduced functionality compared to a model that 
allows any group member to multicast. As such, 
Overcast is not appropriate for applications that re- 
quire extensive use of such a model. However, many 
applications which appear to need multi-source mul- 
ticast, such as a distributed lecture allowing ques- 
tions from the class, do not. In such an application, 
only one “non-root” sender is active at any particu- 
lar time. It would be a simple matter for the sender 
to unicast to the root, which would then perform the 
true multicast on the behalf of the sender. A num- 
ber of projects [15, 17, 22] have used or advocated 
such an approach. 


3.3. Bandwidth Optimization 


Overcast is designed for distribution from a single 
source. As such, small latencies are expected to be 
of less importance to its users than increased band- 
width. Extremely low latencies are only important 
for applications that are inherently two-way, such 
as video conferencing. Overcast is designed with 
the assumption that broadcasting “live” video on 
the Internet may actually mean broadcasting with 
a ten to fifteen second delay. 


Overcast distribution trees are built with the sole 
goal of creating high bandwidth channels from the 
source to all nodes. Although Overcast makes no 
guarantees that the topologies created are optimal, 
our simulations show that they perform quite well. 
The exact method by which high-bandwidth distri- 
bution trees are created and maintained is described 
in Section 4.2. 


3.4 Deployment 


An important goal for Overcast is to be deployable 
on today’s Internet infrastructure. This motivates 
not only the use of an overlay network, but many 
of its details. In particular, deployment must re- 
quire little or no human intervention, costs per node 
should be minimized, and unmodified HTTP clients 
must be able to join multicast groups in the Over- 
cast network. 


To help ease the human costs of deployment, nodes 
in the Overcast network configure themselves in an 
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adaptive distributed tree with a single root. No hu- 
man intervention is required to build efficient dis- 
tribution trees, and nodes can be a part of multiple 
distribution trees. 


Overcast’s implementation on commodity PCs run- 
ning Linux further eases deployment. Development 
is speeded by the familiar programming environ- 
ment, and hardware costs are minimized by con- 
tinually tracking the best price/performance ratio 
available in off-the-shelf hardware. The exact hard- 
ware configuration we have deployed has changed 
many times in the year or so that we have deployed 
Overcast nodes. 


The final consumers of content from an Overcast 
network are HTTP clients. The Overcast proto- 
cols are carefully designed so that unmodified Web 
browsers can become members of a multicast group. 
In Overcast, a multicast group is represented as an 
HTTP URL: the hostname portion names the root 
of an Overcast network and the path represents a 
particular group on the network. All groups with 
the same root share a single distribution tree. 


Using URLs as a namespace for Overcast groups 
has three advantages. First, URLs offer a hierar- 
chal namespace, addressing the scarcity of multi- 
cast group names in traditional IP Multicast. Sec- 
ond, URLs and the means to access them are an 
existing standard. By delivering data over a simple 
HTTP connection, Overcast is able to bring multi- 
casting to unmodified applications. Third, a URL’s 
richer structure allows for simple expression of the 
increased power of Overcast over tradition multi- 
cast. For example, a group suffix of start=10s may 
be defined to mean “begin the content stream 10 
seconds from the beginning.” 


3.5 Example usage 


We have used Overcast to build a _ content- 
distribution application for high-quality video and 
live streams. The application is built out of a pub- 
lishing station (called a studio) and nodes (called 
appliances). Appliances are installed at strategic 
locations in their network. The appliances boot, 
contact their studio, and self-organize into a distri- 
bution tree, as described below. No local adminis- 
tration is required. 


The studio stores content and schedules it for deliv- 
ery to the appliances. Typically, once the content 
is delivered, the publisher at the studio generates 
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a web page announcing the availability of the con- 
tent. When a user clicks on the URL for published 
content, Overcast redirects the request to a nearby 
appliance and the appliance serves the content. If 
the content is video, no special streaming software 
is needed. The user can watch the video over stan- 
dard protocols and a standard MPEG player, which 
is supplied with most browsers. 


An administrator at the studio can control the over- 
lay network from a central point. She can view the 
status of the network (e.g., which appliances are 
up), collect statistics, control bandwidth consump- 
tion, etc. 


Using this system, bulk data can be distributed effi- 
ciently, even if the network between the appliances 
and the studio consists of low-bandwidth or inter- 
mittent links. Given the relative prices of disk space 
and network bandwidth, this solution is far less ex- 
pensive than upgrading all network links between 
the studio and every client. 


4 Protocols 


The previous section described the structure and 
properties of the Overcast overlay network. This 
section describes how it functions: the initializa- 
tion of individual nodes, the construction of the 
distribution hierarchy, and the automatic mainte- 
nance of the network. In particular, we describe 
the “tree” protocol to build distribution trees and 
the “up/down” protocol to maintain the global state 
of the Overcast network efficiently. We close by de- 
scribing how clients (web browsers) join a group and 
how reliable multicasting to clients is performed. 


4.1 Initialization 


When a node is first plugged in or moved to a new 
location it automatically initializes itself and con- 
tacts the appropriate Overcast root(s). The first 
step in the initialization process is to determine an 
IP address and gateway address that the node can 
use for general IP connectivity. If there is a local 
DHCP server then the node can obtain IP configu- 
ration directly data using the DHCP protocol [12]. 
If DHCP is unavailable, a utility program can be 
used from a nearby workstation for manual config- 
uration. 


Once the node has an IP configuration it contacts a 
global, well-known registry, sending along its unique 
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serial number. Based on a node’s serial number, the 
registry provides a list of the Overcast networks the 
node should join, an optional permanent IP config- 
uration, the network areas it should serve, and the 
access controls it should implement. If a node is 
intended to become part of a particular content dis- 
tribution network, the configuration data returned 
will be highly specific. Otherwise, default values 
will be returned and the networks to which a node 
will join can be controlled using a web-based GUI. 


4.2 The Tree Building Protocol 


Self-organization of appliances into an efficient, ro- 
bust distribution tree is the key to efficient opera- 
tion in Overcast. Once a node initializes, it begins a 
process of self-organization with other nodes of the 
same Overcast network. The nodes cooperatively 
build an overlay network in the form of a distri- 
bution tree with the root node at its source. This 
section describes the tree-building protocol. 


As described earlier, the virtual links of the overlay 
network are the only paths on which data is ex- 
changed. Therefore the choice of distribution tree 
can have a significant impact on the aggregate com- 
munication behavior of the overlay network. By 
carefully building a distribution tree, the network 
utilization of content distribution can be signifi- 
cantly reduced. Overcast stresses bandwidth over 
other conceivable metrics, such as latency, because 
of its expected applications. Overcast is not in- 
tended for interactive applications, therefore opti- 
mizing a path to shave small latencies at the ex- 
pense of total throughput would be a mistake. On 
the other hand, Overcast’s architecture as an over- 
lay network allows this decision to be revisited. For 
instance, it may be decided that trees should have 
a fixed maximum depth to limit buffering delays. 


The goal of Overcast’s tree algorithm is to. max- 
imize bandwidth to the root for all nodes. At a 
high level the algorithm proceeds by placing a new 
node as far away from the root as possible with- 
out sacrificing bandwidth to the root. This ap- 
proach leads to “deep” distribution trees in which 
the nodes nonetheless observe no worse bandwidth 
than obtaining the content directly from the root. 
By choosing a parent that is nearby in the network, 
the distribution tree will form along the lines of the 
substrate network topology. 


The tree protocol begins when a newly initialized 
node contacts the root of an Overcast group. The 
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root thereby becomes the current node. Next, the 
new node begins a series of rounds in which it will 
attempt to locate itself further away from the root 
without sacrificing bandwidth back to the root. In 
each round the new node considers its bandwidth 
to current as well as the bandwidth to current 
through each of current’s children. If the band- 
width through any of the children is about as high 
as the direct bandwidth to current, then one of 
these children becomes current and a new round 
commences. In the case of multiple suitable chil- 
dren, the child closest (in terms of network hops) to 
the searching node is chosen. If no child is suitable, 
the search for a parent ends with current. 


To approximate the bandwidth that will be ob- 
served when moving data, the tree protocol mea- 
sures the download time of 10 Kbytes. This mea- 
surement includes all the costs of serving actual 
content. We have observed that this approach to 
measuring bandwidth gives us better results than 
approaches based on low-level bandwidth measure- 
ments such as using ping. On the other hand, we 
recognize that a 10 Kbyte message is too short to 
accurately reflect the bandwidth of “long fat pipes” . 
We plan to move to a technique that uses progres- 
sively larger measurements until a steady state is 
observed. 


When the measured bandwidths to two nodes are 
within 10% of each other, we consider the nodes 
equally good and select the node that is closest, as 
reported by traceroute. This avoids frequent topol- 
ogy changes between two nearly equal paths, as well 
as decreasing the total number of network links used 
by the system. 


A node periodically reevaluates its position in the 
tree by measuring the bandwidth to its current sib- 
lings (an up-to-date list is obtained from the par- 
ent), parent, and grandparent. Just as in the initial 
building phase, a node will relocate below its sib- 
lings if that does not decrease its bandwidth back 
to the root. The node checks bandwidth directly 
to the grandparent as a way of testing its previous 
decision to locate under its current parent. If nec- 
essary the node moves back up in the hierarchy to 
become a sibling of its parent. As a result, nodes 
constantly reevaluate their position in the tree and 
an Overcast network is inherently tolerant of non- 
root node failures. If a node goes off-line for some 
reason, any nodes that were below it in the tree 
will reconnect themselves to the rest of the rout- 
ing hierarchy. When a node detects that its parent 
is unreachable, it will simply relocate beneath its 
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grandparent. If its grandparent is also unreachable 
the node will continue to move up its ancestry until 
it finds alive node. The ancestor list also allows cy- 
cles to be avoided as nodes asynchronously choose 
new parents. A node simply refuses to become the 
parent of a node it believes to be it’s own ances- 
tor. A node that chooses such a node will forced to 
rechoose. 


While there is extensive literature on faster fail-over 
algorithms, we have not yet found a need to opti- 
mize beyond the strategy outlined above. It is im- 
portant to remember that the nodes participating 
in this protocol are dedicated machines that are less 
prone to failure than desktop computers. If this be- 
comes an issue, we have considered extending the 
tree building algorithm to maintain backup parents 
(excluding a node’s own ancestry from considera- 
tion) or an entire backup tree. 


By periodically remeasuring network performance, 
the overlay network can adapt to network condi- 
tions that manifest themselves at time scales larger 
than the frequency at which the distribution tree 
reorganizes. For example, a tree that is optimized 
for bandwidth efficient content delivery during the 
day may be significantly suboptimal during the 
overnight hours (when network congestion is typ- 
ically lower). The ability of the tree protocol to 
automatically adapt to these kinds of changing net- 
work conditions provides an important advantage 
over simpler, statically configured content distribu- 
tion schemes. 


4.3. The Up/Down Protocol 


To allow web clients to join a group quickly, the 
Overcast network must track the status of the Over- 
cast nodes. It may also be important to report sta- 
tistical information back to the root, so that content 
providers might learn, for instance, how often cer- 
tain content is being viewed. This section describes 
a protocol for efficient exchange of information in 
a tree of network nodes to provide the root of the 
tree with information from nodes throughout the 
network. For our needs, this protocol must scale 
sublinearly in terms of network usage at the root, 
but may scale linearly in terms of space (all with 
respect to the number of Overcast nodes). This 
is a simple result of the relative requirements of a 
client for these two resources and the cost of those 
resources. Overcast might store (conservatively) a 
few hundred bytes about each Overcast node, but 
even in a group of millions of nodes, total RAM cost 
for the root would be under $1,000. 


We call this protocol the “up/down” protocol be- 
cause our current system uses it mainly to keep track 
of what nodes are up and what nodes are down. 
However, arbitrary information in either of two large 
classes may be propagated to the root. In particu- 
lar, if the information either changes slowly (e.g., 
up/down status of nodes), or the information can 
be combined efficiently from multiple children into a 
single description (e.g., group membership counts), 
it can be propagated to the root. Rapidly chang- 
ing information that can not be aggregated during 
propagation would overwhelm the root’s bandwidth 
capacity. 


Each node in the network, including the root node, 
maintains a table of information about all nodes 
lower than itself in the hierarchy and a log of all 
changes to the table. Therefore the root node’s ta- 
ble contains up-to-date information for all nodes in 
the hierarchy. The table is stored on disk and cached 
in the memory of a node. 


The basis of the protocol is that each node period- 
ically checks in with the node directly above it in 
the tree. If a child fails to contact its parent within 
a preset interval, the parent will assume the child 
and all its descendants have “died”. That is, either 
the node has failed, an intervening link has failed, or 
the child has simply changed parents. In any case, 
the parent node marks the child and its descendants 
“dead” in its table. Parents never initiate contact 
with descendants. This is a byproduct of a design 
that is intended to cross firewalls easily. All node 
failures must be detected by a failure to check in, 
rather than active probing. 


During these periodic check-ins, a node reports new 
information that it has observed or been informed 
of since it last checked in. This includes: 


e “Death certificates” - Children that have 
missed their expected report time. 


e “Birth certificates” - Nodes that have become 
children of the reporting node. 


e Changes to the reporting node’s “extra infor- 
mation.” 


e Certficates or changes that have been propa- 
gated to the node from its own children since 
its last checkin. 


This simple protocol exhibits a race condition when 
a node chooses a new parent. The moving node’s 
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former parent propagates a death certificate up the 
hierarchy, while at nearly the same time the new 
parent begins propagating a birth certificate up the 
tree. If the birth certificate arrives at the root first, 
when the death certificate arrives the root will be- 
lieve that the node has failed. This inaccuracy will 
remain indefinitely since a new birth certificate will 
only be sent in response to a change in the hierarchy 
that may not occur for an arbitrary period of time. 


To alleviate this problem, a node maintains a se- 
quence number indicating of how many times it has 
changed parents. All changes involving a node are 
tagged with that number. A node ignores changes 
that are reported to it about a node if it has already 
seen a change with a higher sequence number. For 
instance, a node may have changed parents 17 times. 
When it changes again, its former parent will propa- 
gate a death certificate annotated with 17. However, 
its new parent will propagate a birth certificate an- 
notated with 18. If the birth certificate arrives first, 
the death certificate will be ignored since it is older. 


An important optimization to the up/down protocol 
avoids large sets of birth certificates from arriving 
at the root in response to a node with many de- 
scendants choosing a new parent. Normally, when 
a node moves to a new parent, a birth certificate 
must be sent out for each of its descendants to its 
new parent. This maintains the invariant that a 
node knows the parent of all its descendants. Keep 
in mind that a birth certificate is not only a record 
that a node exists, but that it has a certain parent. 


Although this large set of updates is required, it is 
usually unnecessary for these updates to continue 
far up the hierarchy. For example, when a node 
relocates beneath a sibling, the sibling must learn 
about all of the node’s descendants, but when the 
sibling, in turn, passes these certificates to the orig- 
inal parent, the original parent notices that they do 
not represent a change and quashes the certificate 
from further propagation. 


Using the up/down protocol, the root of the hi- 
erarchy will receive timely updates about changes 
to the network. The freshness of the information 
can be tuned by varying the length of time between 
check-ins. Shorter periods between updates guaran- 
tee that information will make its way to the root 
more quickly. Regardless of the update frequency, 
bandwidth requirements at the root will be propor- 
tional to the number of changes in the hierarchy 
rather than the size of the hierarchy itself. 
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4.4 Replicating the root 


In Overcast, there appears to be the potential for 
significant scalability and reliability problems at the 
root. The up/down protocol works to alleviate the 
scalability difficulties in maintaining global state 
about the distribution tree, but the root is still 
responsible for handling all join requests from all 
HTTP clients. The root handles such requests by 
redirection, which is far less resource intensive than 
actually delivering the requested content. Nonethe- 
less, the possibility of overload remains for particu- 
larly popular groups. The root is also a single point 
of failure. 


To address this, overcast uses a standard technique 
used by many popular websites. The DNS name of 
the root resolves to any number of replicated roots 
in round-robin fashion. The database used to per- 
form redirections is replicated to all such roots. In 
addition, IP address takeover may be used for imme- 
diate failover, since DNS caching may cause clients 
to continue to contact a failed replica. This sim- 
ple, standard technique works well for this purpose 
because handling joins from HTTP clients is a read- 
only operation that lends well to distribution over 
numerous replicas. 


There remains, however, a single point of failure for 
the up/down protocol. The functionality of the root 
in the up/down protocol cannot be distributed so 
easily because its purpose is to maintain changing 
state. However the up/down protocol has the use- 
ful property that all nodes maintain state for nodes 
below them in the distribution tree. Therefore, a 
convenient technique to address fault tolerance is to 
specially construct the top of the hierarchy. 


Starting with the root, some number of nodes are 
configured linearly, that is, each has only one child. 
In this way all other overcast nodes lie below these 
top nodes. Figure 2 shows a distribution tree in 
which the top three nodes are arranged linearly. 
Each of these nodes has enough information to act 
as the root of the up/down protocol in case of a fail- 
ure. This technique has the drawback of increasing 
the latency of content distribution unless special- 
case code skips the extra roots during distribution. 
If latency were important to Overcast this would be 
an important, but simple, optimization. 


“Linear roots” work well with the need for replica- 
tion to address scalability, as mentioned above. The 
set of linear nodes has all the information needed to 
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Figure 2: A specially configured distribution topology that 
allows either of the grey nodes to quickly stand in as the root 
(black) node. All filled nodes have complete status informa- 
tion about the unfilled nodes. 


perform Overcast joins, therefore these nodes are 
perfect candidates to be used in the DNS round- 
robin approach to scalability. By choosing these 
nodes, no further replication is necessary. 


4.5 Joining a multicast group 


To join a multicast group, a Web client issues an 
HTTP GET request with the URL for a group. The 
hostname of the URL names the root node(s). The 
root uses the pathname of the URL, the location of 
the client, and its database of the current status of 
the Overcast nodes to decide where to connect the 
client to the multicast tree. Because status informa- 
tion is constantly propagated to the root, a decision 
may be made quickly without further network traf- 
fic, enabling fast joins. 


Joining a group consists of selecting the best server 
and redirecting the client to that server. The de- 
tails of the server selection algorithm are beyond 
the scope of this paper as considerable previous 
work [3, 18] exists in this area. Furthermore, Over- 
cast’s particular choices are constrained consider- 
ably by a desire to avoid changes at the client. With- 
out such a constraint simpler choices could have 
been made, such as allowing clients to participate 
directly in the Overcast tree building protocol. 


Although we do not discuss server selection here, a 
number of Overcast’s details exist to support this 
important functionality, however it may actually be 
implemented. A centralized root performing redi- 
rections is convenient for an approach involving 
large tables containing collected Internet topology 
data. The up/down algorithm allows for redirec- 
tions to nodes that are known to be functioning. 


4.6 Multicasting with Overcast 


We refer to reliable multicasting on an overcast net- 
work as “overcasting”. Overcasting proceeds along 


the distribution tree built by the tree protocol. 
Data is moved between parent and child using TCP 
streams. If a node has four children, four separate 
connections are used. The content may be pipelined 
through several generations in the tree. A large file 
or a long-running live stream may be in transit over 
tens of different TCP streams at a single moment, 
in several layers of the distribution hierarchy. 


If a failure occurs during an overcast, the distri- 
bution tree will rebuild itself as described above. 
After rebuilding the tree, the overcast resumes for 
on-demand distributions where it left off. In order 
to do so, each node keeps a log of the data it has 
received so far. After recovery, a node inspects the 
log and restarts all overcasts in progress. 


Live content on the Internet today is typically 
buffered before playback. This compensates for mo- 
mentary glitches in network throughput. Overcast 
can take advantage of this buffering to mask the 
failure of a node being used to Overcast data. As 
long as the failure occurs in a node that is not at the 
edge of the Overcast network, an HTTP client need 
not ever become aware that the path of data from 
the root has been changed in the face of failure. 


5 Evaluation 


In this section, the protocols presented above are 
evaluated by simulation. Although we have de- 
ployed Overcast in the real world, we have not yet 
deployed on a sufficiently large network to run the 
experiments we have simulated. 


To evaluate the protocols, an overlay network is sim- 
ulated with increasing numbers of overcast nodes 
while keeping the total number of network nodes 
constant. Overcast should build better trees as 
more nodes are deployed, but protocol overhead 
may grow. 


We use the Georgia Tech Internetwork Topology 
Models [25] (GT-ITM) to generate the network 
topologies used in our simulations. We use the 
“transit-stub” model to obtain graphs that more 
closely resemble the Internet than a pure random 
construction. GT-ITM generates a transit-stub 
graph in stages, first a number of random back- 
bones (transit domains), then the random structure 
of each back-bone, then random “stub” graphs are 
attached to each node in the backbones. 


We use this model to construct five different 600 
node graphs. Each graph is made up of three tran- 
sit domains. These domains are guaranteed to be 
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connected. Each transit domain consists of an aver- 
age of eight stub networks. The stub networks con- 
tain edges amongst themselves with a probability of 
0.5. Each stub network consists of an average of 25 
nodes, in which nodes are once again connected with 
a probability of 0.5. These parameters are from the 
sample graphs in the GT-ITM distribution; we are 
unaware of any published work that describes pa- 
rameters that might better model common Internet 
topologies. 


We extended the graphs generated by GT-ITM 
with bandwidth information. Links internal to 
the transit domains were assigned a bandwidth 
of 45Mbits/s, edges connecting stub networks to 
the transit domains were assigned 1.5Mbits/s, fi- 
nally, in the local stub domain, edges were assigned 
100Mbit/s. These reflect commonly used network 
technology: T3s, T1ls, and Fast Ethernet. All 
measurements are averages over the five generated 
topologies. 


Empirical measurements from actual Overcast 
nodes show that a single Overcast node can eas- 
ily support twenty clients watching MPEG-1 videos, 
though the exact number is greatly dependent on 
the bandwidth requirements of the content. Thus 
with a network of 600 overcast nodes, we are simu- 
lating multicast groups of perhaps 12,000 members. 


5.1 Tree protocol 


The efficiency of Overcast depends on the position- 
ing of Overcast nodes. In our first experiments, we 
compare two different approaches to choosing po- 
sitions. The first approach, labelled “Backbone”, 
preferentially chooses transit nodes to contain Over- 
cast nodes. Once all transit nodes are Overcast 
nodes, additional nodes are chosen at random. This 
approach corresponds to a scenario in which the 
owner of the Overcast nodes places them strategi- 
cally in the network. In the second, labelled “Ran- 
dom”, we select all Overcast nodes at random. This 
approach corresponds to a scenario in which the 
owner of Overcast nodes does not pay attention to 
where the nodes are placed. 


The goal of Overcast’s tree-building protocol is to 
optimize the bottleneck bandwidth available back 
to the root for all nodes. The goal is to provide 
each node with the same bandwidth to the root that 
the node would have in an idle network. Figure 3 
compares the sum of all nodes’ bandwidths back to 
the root in Overcast networks of various sizes to 
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Figure 3: Fraction of potential bandwidth provided by 
Overcast. 


the sum of all nodes’ bandwidths back to the root 
in an optimal distribution tree using router-based 
software. This indicates how well Overcast performs 
compared to IP Multicast. 


The main observation is that, as expected, the back- 
bone strategy for placing Overcast nodes is more 
effective than the random strategy, but the results 
of random placement are encouraging nonetheless. 
Even a small number of deployed Overcast nodes, 
positioned at random, provide approximately 70%- 
80% of the total possible bandwidth. 


It is extremely encouraging that, when using the 
backbone approach, no node receives less bandwidth 
under Overcast than it would receive from IP Mul- 
ticast. However some enthusiasm must be withheld, 
because a simulation artifact has been left in these 
numbers to illustrate a point. 


Notice that the backbone approach and the random 
approach differ in effectiveness even when all 600 
nodes of the network are Overcast nodes. In this 
case the same nodes are participating in the proto- 
col, but better trees are built using the backbone 
approach. This illustrates that the trees created by 
the tree-building protocol are not unique. The back- 
bone approach fares better by this metric because 
in our simulations backbone nodes were turned on 
first. This allowed backbone nodes to preferrentially 
form the “top” of the tree. This indicates that in 
future work it may be beneficial to extend the tree- 
building protocol to accept hints that mark certain 
nodes as “backbone” nodes. These nodes would 
preferentially form the core of the distribution tree. 


Overcast appears to perform quite well for its in- 
tended goal of optimizing available bandwidth, but 
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Figure 4: Ratio of the number of times a packet must “hit 
the wire” to be propagated through an Overcast network to a 
lower bound estimate of the same measure for IP Multicast. 


it is reasonable to wonder what costs are associated 
with this performance. 


To explore this question we measure the network 
load imposed by Overcast. We define network load 
to be the number of times that a particular piece of 
data must traverse a network link to reach all Over- 
cast nodes. In order to compare to IP Multicast 
Figure 4 plots the ratio of the network load imposed 
by Overcast to a lower bound estimate of IP Mul- 
ticast’s network load. For a given set of nodes, we 
assume that IP Multicast would require exactly one 
less link than the number of nodes. This assumes 
that all nodes are one hop away from another node, 
which is unlikely to be true in sparse topologies, but 
provides a lower bound for comparison. 


Figure 4 shows that for Overcast networks with 
greater than 200 nodes Overcast imposes somewhat 
less than twice as much network load as IP Multi- 
cast. In return for this extra load Overcast offers 
reliable delivery, immediate deployment, and future 
flexibility. For networks with few Overcast nodes, 
Overcast appears to impose a considerably higher 
network load than IP Multicast. This is a result of 
our optimistic lower bound on IP Multicast’s net- 
work load, which assumes that 50 randomly placed 
nodes in a 600 node network can be spanned by 49 
links. 


Another metric to measure the effectiveness of an 
application-level multicast technique is stress, pro- 
posed in [16]. Stress indicates the number of times 
that the same data traverses a particular physical 
link. By this metric, Overcast performs quite well 
with average stresses of between 1 and 1.2. We do 
not present detailed analysis of Overcast’s perfor- 
mance by this metric, however, because we believe 
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Figure 5: Number of rounds to reach a stable distribution 
tree as a function of the number of overcast nodes and the 
length of the lease period. 


that network load is more telling for Overcast. That 
is, Overcast has quite low scores for average stress, 
but that metric does not describe how often a longer 
route was taken when a shorter route was available. 


Another question is how fast the tree protocol con- 
verges to a stable distribution tree, assuming a sta- 
ble underlying network. This is dependent on three 
parameters. The round period controls how long a 
node that has not yet determined a stable position 
in the hierarchy will wait before evaluating a new set 
of potential parents. The reevaluation period deter- 
mines how long a node will wait before reevaluating 
its position in the hierarchy once it has obtained a 
stable position. Finally the lease period determines 
how long a parent will wait to hear from a child 
before reporting the child’s death. 


For convenience, we measure all convergence times 
in terms of the fundamental unit, the round time. 
We also set the reevaluation period and lease pe- 
riod to the same value. Figure 5 shows how long 
Overcast requires to converge if an entire Overcast 
network is simultaneously activated. ‘To demon- 
strate the effect of a changing reevaluation and lease 
period, we plot for the “standard” lease time—10 
rounds, as well as longer and shorter periods. Lease 
periods shorter than five rounds are impractical be- 
cause children actually renew their leases a small 
random number of rounds (between one and three) 
before their lease expires to avoid being thought 
dead. We expect that a round period on the order of 
1-2 seconds will be practical for most applications. 


We next measure convergence times for an existing 
Overcast network in which overcast nodes are aclded 
or fail. We simulate overcast networks of various 
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Figure 6: Number of rounds to recover a stable distribution 
tree as a function of the number of nodes that change state 
and the number of nodes in the network. 


sizes until they quiesce, add and remove Overcast 
nodes, and then simulate the network until it qui- 
esces once again. We measure the time, in rounds, 
for the network to quiesce after the changes. We 
measure for various numbers of additions and re- 
movals allowing us to assess the dependence of con- 
vergence on how many nodes have changed state. 
We measure only the backbone approach. 


Figure 6 plots convergence times (using a 10 round 
lease time) against the number of overcast nodes in 
the network. The convergence time for node fail- 
ures is quite modest. In all simulations the Over- 
cast network reconverged after less than three lease 
times. Furthermore, the reconvergence time scaled 
well against both the number of nodes failing and 
the total number of nodes in the overcast network. 
In neither case was the convergence time even lin- 
early affected. 


For node additions, convergence times do appear 
more closely linked to the size of the Overcast net- 
work. This makes intuitive sense because new nodes 
are navigating the network to determine their best 
location. Even so, in all simulations fewer than 
five lease times are required. It is important to 
note that an Overcast network continues to func- 
tion even while stabilizing. Performance may be 
somewhat impacted by increased measurement traf- 
fic and by TCP setup and tear down overhead as 
parents change, but such disruptions are localized. 


5.2 Up/Down protocol 


The goal of the up/down algorithm is to minimize 
the bandwidth required at the root node while main- 
taining timely status information for the entire net- 
work. Factors that affect the amount of bandwidth 
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Figure 7: Certificates received at the root in response to 
node additions. 


used include the size of the overcast network and 
the rate of topology changes. Topology changes oc- 
cur when the properties of the underlying network 
change, nodes fail, or nodes are added. Therefore 
the up/down algorithm is evaluated by simulating 
overcast networks of various sizes in which various 
numbers of failures and additions occur. 


To assess the up/down protocol’s ability to provide 
timely status updates to the root without undue 
overhead we keep track of the number of certificates 
(for both “birth” and “death”) that reach the root 
during the previous convergence tests. This is in- 
dicative of the bandwidth required at the root node 
to support an overcast network of the given size and 
is dependent on the amount of topology change in- 
duced by the additions and deletions. 


Figure 7 graphs the number of certificates received 
by the root node in response to new nodes being 
brought up in the overcast network. Remember, the 
root may receive multiple certificates per node ad- 
dition because the addition is likely to cause some 
topology reconfiguration. Each time a node picks 
a new parent that parent propagates a birth cer- 
tificate. These results indicate that the number 
of certificates is quite modest: certainly no more 
than four certificates per node addition, usually ap- 
proximately three. What is more important is that 
the number of certificates scales more closely to the 
number of new nodes than the size of the overcast 
network. This gives evidence that overcast can scale 
to large networks. 


Similarly, Overcast requires few certificates to react 
to node failures. Figure 8 shows that in the common 
case, no more than four certificates are required per 
node failure. Again, because the number of certifi- 
cates is proportional to the number of failures rather 
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Figure 8: Certificates received at the root in response to 
node deletions. 


than the size of the network, Overcast appears to of- 
fer the ability to scale to large networks. 


On the other hand, Figure 8 shows that there are 
some cases that fall far outside the norm. The large 
spikes at 50 and 150 node networks with 5 and 10 
failures occurred because of failures that happened 
to occur near the root. When a node with a sub- 
stantial number of children chooses a new parent 
it must convey it’s entire set of descendants to its 
new parent. That parent then propagates the entire 
set. However, when the information reaches a node 
that already knows the relationships in question, the 
update is quashed. In these cases, because the re- 
configurations occurred high in the tree there was 
no chance to quash the updates before they reached 
the root. In larger networks such failures are less 
likely. 


6 Conclusions 


We have described a simple tree-building protocol 
that yields bandwidth-eflicient distribution trees for 
single-source multicast and our up/down protocol 
for providing timely status updates to the root of the 
distribution tree in scalable manner. Overcast im- 
plements these protocols in an overlay network over 
the existing Internet. The protocols allow Overcast 
networks to dynamically adapt to changes (such as 
congestion and failures) in the underlying network 
infrastructure and support large, reliable single- 
source multicast groups. Geographically-dispersed 
businesses have deployed Overcast nodes in small- 
scale Overcast networks for distribution of high- 
quality, on-demand video to unmodified desktops. 


Simulation studies with topologies created with the 
Georgia Tech Internetwork Topology Models show 


that Overcast networks work well on large-scale net- 
works, supporting multicast groups of up to 12,000 
members. Given these results and the low cost for 
Overcast nodes, we believe that putting computa- 
tion and storage in the network fabric is a promis- 
ing approach for adding new services to the Internet 
incrementally. 
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Abstract 


This paper describes the implementation and evaluation 
of an operating system module, the Congestion Manager 
(CM), which provides integrated network flow manage- 
ment and exports a convenient programming interface 
that allows applications to be notified of, and adapt to, 
changing network conditions. We describe the API by 
which applications interface with the CM, and the archi- 
tectural considerations that factored into the design. To 
evaluate the architecture and API, we describe our im- 
plementations of TCP; a streaming layered audio/video 
application; and an interactive audio application using 
the CM, and show that they achieve adaptive behavior 
without incurring much end-system overhead. All flows 
including TCP benefit from the sharing of congestion 
information, and applications are able to incorporate 
new functionality such as congestion control and adap- 
tive behavior. 


1 Introduction 


The impressive scalability of the Internet infrastructure 
is in large part due to a design philosophy that advo- 
cates a simple architecture for the core of the network, 
with most of the intelligence and state management im- 
plemented in the end systems [10]. The service model 
provided by the network substrate is therefore primar- 
ily a “best-effort” one, which implies that packets may 
be lost, reordered or duplicated, and end-to-end delays 
may be variable. Congestion and accompanying packet 
loss are common in heterogeneous networks like the In- 
ternet because of overload, when demand for router re- 
sources, such as bandwidth and buffer space, exceeds 
what is available. Thus, end systems in the Internet 
should incorporate mechanisms for detecting and react- 
ing to network congestion, probing for spare capacity 
when the network is uncongested, as well as managing 
their available bandwidth effectively. 

Previous work has demonstrated that the result of 
uncontrolled congestion is a phenomenon commonly 
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called “congestion collapse” [8, 13]. Congestion collapse 
is largely alleviated today because the popular end-to- 
end Transmission Control Protocol (TCP) [30, 40] in- 
corporates sound congestion avoidance and control al- 
gorithms. However, while TCP does implement con- 
gestion control [18], many applications including the 
Web [6, 12] use several logically different streams in par- 
allel, resulting in multiple concurrent TCP connections 
between the same pair of hosts. As several researchers 
have shown [2, 3, 27, 28, 42], these concurrent connec- 
tions compete with — rather than learn from — each 
other about network conditions to the same receiver, 
and end up being unfair to other applications that use 
fewer connections. The ability to share congestion in- 
formation between concurrent flows is therefore a useful 
feature, one that promotes cooperation among different 
flows rather than adverse competition. 

In today’s Internet is the increasing number of appli- 
cations that do not use TCP as their underlying trans- 
port, because of the constraining reliability and order- 
ing semantics imposed by its in-order byte-stream ab- 
straction. Streaming audio and video [25, 34, 41] and 
customized image transport protocols are significant ex- 
amples. Such applications use custom protocols that 
run over the User Datagram Protocol (UDP) [29], often 
without implementing any form of congestion control. 
The unchecked proliferation of such applications would 
have a significant adverse effect on the stability of the 
network [3, 8, 13]. 

Many Internet applications deliver documents and 
images or strearm audio and video to end users and 
are interactive in nature. A simple but useful figure-of- 
merit for interactive content delivery is the end-to-end 
download latency; users typically wait no more than a 
few seconds before aborting a transfer if they do not 
observe progress. Therefore, it would be beneficial for 
content providers to adapt what they disseminate to the 
state of the network, so as not to exceed a threshold 
latency. Fortunately, such content adaptation is possi- 
ble for most applications. Streaming audio and video 
applications typically encode information in a range of 
formats corresponding to different encoding (transmis- 
sion) rates and degrees of loss resiliency. Image encod- 
ing formats accommodate a range of qualities to suit a 
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variety of client. requirements. 

Today, the implementor of an Internet content dis- 
semination application has a challenging task: for her 
application to be safe for widespread Internet deploy- 
ment, she must either use TCP and suffer the conse- 
quences of its fully-reliable, byte-stream abstraction, or 
use an application-specific protocol over UDP. With 
the latter option, she must re-implement congestion 
control mechanisms, thereby risking errors not just in 
the implementation of her protocol, but also in the 
implementation of the congestion controller. Further- 
more, neither alternative allows for sharing conges- 
tion information across flows. Finally, the common 
application programming interface (API) classes for 
network applications—Berkeley sockets, streams, and 
Winsock [31]—do not expose any information about the 
state of the network to applications in a standard way!. 
This makes it difficult for applications running on ex- 
isting end host operating systems to make an informed 
decision, taking network variables into account, during 
content adaptation. 


1.1 The Congestion Manager 


Our previous work provided the rationale, initial design, 
and simulation of the Congestion Manager, an end- 
system architecture for sharing congestion information 
between multiple concurrent flows [3]. In this paper, we 
describe the implementation and evaluation of the CM 
in the Linux operating system. We focus on a version 
of the CM where the only changes made to the current 
IP stack are at the data sender, with feedback about 
congestion or successful data receptions being provided 
by the receiver CM applications to their sending peers, 
which communicate this information to the CM via an 
API. We present asummary of the API used by applica- 
tions to adapt their transmissions to changing network 
conditions, and focus on those elements of the API that 
changed in the transition from the simulation to the im- 
plementation. 

We evaluate the Congestion Manager by posing and 
answering several key questions: 

Is its callback interface, used to inform ap- 
plications of network state and other events, ef- 
fective for a diverse set of applications to adapt 
without placing a significant burden on develop- 
ers? 

Because most robust congestion control algorithms 
rely on receiver feedback, it is natural to expect that 
a CM receiver is needed to inform the CM sender of 
successful transmissions and packet losses. However, 
to facilitate deployment, we have designed our system 
to take advantage of the fact that several protocols in- 
cluding TCP and other applications already incorporate 
some form of application-specific feedback, providing 


? Utilities like netstat and ifconfig provide some infor- 
mation about devices, but not end-to-end performance 
information that can be used for adapting content. 
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the CM with the loss and timing information it needs 
to function effectively. 

Using the CM API, we implement several case stud- 
ies both in and out of the kernel, showing the applica- 
bility of the API to many different application archi- 
tectures. Our implementation of a layered streaming 
audio/video application demonstrates that the CM ar- 
chitecture can be used to implement highly adaptive 
congestion controlled applications. Adaptation via the 
CM helps these applications achieve better performance 
and also be fair to other flows on the Internet. 

We have also modified a legacy application—the In- 
ternet audio tool vat from the MASH toolkit [23]— 
to use the CM to perform adaptive real-time delivery. 
Since less than one hundred lines of source code mod- 
ification was required to CM-enable this complex ap- 
plication and make it adapt to network conditions, we 
believe it demonstrates the ease with which the CM 
makes applications adaptive. 

Is the congestion control correct? 

As a trusted kernel module, the CM frees both trans- 
port pretocols and applications from the burden of im- 
plementing congestion management. We show that the 
CM behaves in the same network-friendly manner as 
TCP for single flows. Furthermore, by integrating flow 
information between both kernel protocols and user ap- 
plications, we ensure that an ensemble of concurrent 
flows is not an overly aggressive user of the network. 

In today’s off-the-shelf operating systems, 
does the CM place any performance limitations 
upon applications? 

We find that our implementation of TCP (which 
uses the CM for its congestion control) has essentially 
the same performance as standard TCP, with the added 
benefits of integrated congestion management across 
flows, with only small (0-3%) CPU overhead. 

In a CM system where no changes are made to the 
receiver protocol stack, UDP-based applications must 
implement a congestion feedback mechanism, resulting 
in more overhead compared to the TCP applications. 
However, we show that these applications remain vi- 
able, and that the architectural change and API calls 
reduce worst-case throughput by 0 - 25%, even for appli- 
cations that desire fine-grained information about the 
network on a per-packet basis. 

To our knowledge, this is the first implementation 
of a general application-independent system that com- 
bines integrated flow management with a convenient 
API to enable content adaptation. The end-result is 
that applications achieve the desirable congestion con- 
trol properties of long-running TCP connections, to- 
gether with the flexibility to adapt data transmissions 
to prevailing network conditions. 

The rest of this paper is organized as follows. Sec- 
tion 2 describes our system architecture and implemen- 
tation. Section 3 describes how network-adaptive appli- 
cations can be engineered using the CM, while Section 4 
presents the results of several experiments. In Section 5, 
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we discuss some miscellaneous details and open issues 
in the CM architecture. We survey related work in Sec- 
tion 6 and conclude with a summary in Section 7. 


2 System Architecture and Implemen- 
tation 


The CM performs two important functions. First, it en- 
ables efficient multiplexing and congestion control by in- 
tegrating congestion management across multiple flows. 
Second, it enables efficient application adaptation to 
congestion by exposing its knowledge of network con- 
ditions to applications. Most of the CM functionality 
in our Linux implementation is in-kernel; this choice 
makes it convenient to integrate congestion manage- 
ment across both TCP flows and other user-level pro- 
tocols, since TCP is implemented in the kernel. 

To perform efficient aggregation of congestion infor- 
mation across concurrent flows, the CM has to identify 
which flows potentially share a common bottleneck link 
en route to various receivers. In general, this is a diffi- 
cult problem, since it requires an understanding of the 
paths taken by different flows. However, in today’s In- 
ternet, all flows destined to the same end host take the 
same path i in the common case, and we use this group 
of flows as the default granularity of flow aggregation’. 
We call this group a macroflow: a group of flows that 
share the same congestion state, control algorithms, and 
state information in the CM. Each flow has a sending 
application that is responsible for its transmissions; we 
call this a CM client. CM clients are in-kerne} protocols 
like TCP or user-space applications. 


The CM incorporates a congestion controller that. 


performs congestion avoidance and control on a per- 
macroflow basis. It uses a window-based algorithm that 
mimics TCP’s additive-increase/multiplicative decrease 
(AIMD) scheme to ensure fairness to other TCP flows 
on the Internet. However, the modularity provided by 
the CM encourages experimentation with other non- 
AIMD schemes that may be better suited to specific 
data types such as audio or video. 

While the congestion controller determines what the 
current window (rate) ought to be for each macroflow, 
a scheduler decides how this is apportioned among the 
constituent flows. Currently, our implementation uses 
a standard unweighted round-robin scheduler. 

In-kernel CM clients such as a TCP sender use CM 
function calls to transmit data and learn about net- 
work conditions and events. In contrast, user-space 
clients interact with the CM using a portable, platform- 
independent API described in Section 2.1. A platform- 
dependent CM library, libcm, is responsible fer inter- 
facing between the kernel and these clients, and is de- 
scribed in Section 2.2. These components are shown in 


2 This is not strictly true in the presence of network-layer 
differentiated services. We address this issue later in this 
section and in Section 5. 
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Figure 1. Architecture of the congestion manager at 
the data sender, showing the CM library and the CM. 
The dotted arrows show callbacks, and solid lines show 
the datapath. UDP-CC is a congestion-controlled UDP 
socket implemented using the CM. 


Figure 1. 

When aclient opens a CM-enabled socket, the CM 
allocates a flow to it and assigns the flow to the appro- 
priate macroflow based on its destination. The client 
initiates data transmission by requesting permission to 
send data. At some point in the future depending on the 
available rate, the CM issues a callback permitting the 
client to send data. The client then transrnits data, and 
tells the CM it has done so. When the client receives 
feedback from the receiver about its past transmissions, 
it notifies the CM about these and continues. 

When a client makes a request to send on a flow, the 
scheduler checks whether the corresponding macroflow’s 
window is open. If so, the request is granted and the 
client notified, upon which it may send some data. 
Whenever any data is transmitted, the sender’s IP layer 
notifies the CM, allowing it to “charge” the transmis- 
sion to the appropriate macroflow. When the client re- 
ceives feedback from its remote counterpart, it informs 
the CM of the loss rate, number of bytes transmitted 
correctly, and the observed round trip time. On a suc- 
cessful transmission, the CM opens up the window ac- 
cording to its congestion management algorithm and 
grants the next, if any, pending request on a flow as- 
sociated with this macroflow. The scheduler also has 
a timer-driven component to perform background tasks 
and error handling. 
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2.1 CM API 


The CM APTis specified as a set of functions and call- 
backs that a client uses to interface with the CM. It 
specifies functions for managing state, for performing 
data transmissions, for applications to inform the CM 
of losses, fer querying the CM about network state, and 
for constructing and splitting macroflows if the default 
per-destination aggregation is unsuitable for an appli- 
cation. The CM API is discussed in detail in [3], which 
presents the design rationale for the Congestion Man- 
ager. Here we provide an overview of the API and a 
discussion of those features which changed during the 
transition from simulation to implementation. 


9.1.1 State management 


All CM applications call cm-open() before using the 
CM, passing the source and destination addresses and 
transport-layer port numbers, in the form of a struct 
sockaddr. The original CM API required only a des- 
tination address, but the source address specification 
was necesary to handle multihomed hosts. cm-open re- 
turns a flow identifier (cm_flowid), which is used as a 
handle for all futuxe CM calls. Applications may call 
cm_mtu(cm_flowid) to obtain the maximum transmis- 
sion unit to a destination. When a flow terminates, the 
application should call cm_close(cm_flowid). 


2.1.2 Data transmission 


There are three ways in which an application can use 
the CM to transmit data. These allow a variety of adap- 
tation strategies, depending on the nature of the client 
application and its software structure. 


(i) Buffered send. This API uses a conventional 
write () or sendto() call, but the resulting data 
transmission is paced by the Congestion Manager. 
We use this to implement a generic congestion- 
controlled UDP socket (without content adapta- 
tion), useful for bulk transmissions that do not re- 
quire TCP-style reliability or fine-grained control 
over what data gets sent at a given point in time. 


(ii 


~~ 


Request/callback. This is the preferred mode 
of communication for adaptive senders that are 
based on the ALF (Application-Level Fram- 
ing [11]) principle. Here, the client does 
not send data via the CM; rather, it calls 
cm_request(cm_flowid) and expects a notifi- 
cation via the cmapp-send(cm+flowid) callback 
when this request is granted by the CM, at which 
time the client transmits its data. This approach 
puts the sender in firm control of deciding what to 
transmit at a given time, and allows the sender to 
adapt to sudden changes in network performance, 
which is hard to do in a conventional buffered 
transmission API. The client callback is a grant 
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for the flow to send up to MTU bytes of data. 
Each call to cm_request() is an implicit request 
for sending up to MTU bytes, which simplifies the 
internal implementation of the CM. This API is 
ideally suited for an implementation of TCP, since 
it needs to make a decision at each stage about 
whether to retransmit a segment or send a new 
one. In the implementation, the cmapp-send call- 
back now provides the client with the ID of the 
flow that may transmit. To allow for client pro- 
gramming flexibility, the client may now specify 
its callback function via cm-register-_send(). 


(iii) Rate callback. A self-timed application trans- 
mitting on a fixed schedule may receive callbacks 
from the CM notifying it when the parameters of 
its communication channel have changed, so that 
it can change the frequency of its timer loop or 
its packet size. The CM informs the client of the 
rate, round-trip time, and packet loss rate for a 
flow via the cmapp-update () callback. During im- 
plementation, we added a registration function, 
cm_register_update() to select the rate callback 
function, and the cm_thresh(down,up) function: 
If the rate reduces by a factor of down or increases 
by a factor of up, the CM calls cmapp_update(). 
This transmission API is ideally suited for stream- 
ing layered audio and video applications. 


2.1.3 Application notifications 


One of the goals of our work was to investigate a 
CM implementation that requires no changes at the 
receiver. Performing congestion management requires 
feedback about transmissions: TCP provides this feed- 
back automatically; some UDP applications may need 
to be modified to do so, but without any system- 
wide changes. Senders must then inform the CM 
about the number of sent and received packets, type 
of congestion loss if any, and a round-trip time sam- 
ple using the cm-update( cmflowid, nsent, nrecd, 
lossmode, rtt) function. The CM distinguishes be- 
tween “persistent” congestion as would occur on a TCP 
timeout, versus “transient” congestion when only one 
packet in a window is lost. It also allows congestion 
to be notified using Explicit Congestion Notification 
(ECN) [32], which uses packet markings rather than 
drops to infer congestion. 

To perform accurate bookkeeping of the congestion 
window and outstanding bytes for a macroflow, the 
CM needs to know of each successful transmission from 
the host. Rather than encumber clients with reporting 
this information, we modify the IP output routine to 
call cmnotify(cmflowid, nsent) on each transmis- 
sion. (The IP layer obtains the cm_flowid using a well- 
defined CM interface that takes the flow parameters 
(addresses, ports, protocol field) as arguments.) How- 
ever, if a client decides not to transmit any data upon a 
cmapp.send() callback invocation, it is expected to call 
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cm_notify(dst, 0) to allow the CM to permit some 
other flows on the macroflow to transmit data. 


2.1.4 Querying 


If a client wishes to learn about its (per-flow) avail- 
able bandwidth and round-trip time, it can use the 
cm.query() call, which returns these quantities. This 
is especially useful at the beginning of a stream when 
clients can make an informed decision about the data 
encoding to transmit (e.g., a large color or smaller grey- 
scale image). 


2.2 libcm: The CM library 


The CM library provides users with the convenience of 
a callback-based API while separating them from the 
details of how the kernel to user callbacks are imple- 
mented. While direct function callbacks are convenient 
and efficient in the same address space, as is the case 
when the kernel TCP is a client of the CM, callbacks 
from the kernel to user code in conventional operating 
systems are more difficult. A key decision in the imple- 
mentation of libcem was choosing a kernel/user interface 
that maximizes portability, and minimizes both perfor- 
mance overhead and the difficulty of integration with 
existing applications. The resulting internal interface 
between libcm and the kernel is: 


1. select() on a single per-application CM control 
socket. The write bit indicates that a flow may 
send data, and the exception bit indicates that 
network conditions have changed. 


2. Perform an ioctl to extract a list of all flow IDs 
which may send, or to receive the current network 
conditions for a flow. 


Note that client programs of the CM do not see 
this interface; they see only the standard cm_* func- 
tions provided by libcm. The use of sockets or signals 
does change the way the application’s event handling 
loop interacts with libcm; after passing the socket into 
libcn, the library performs the appropriate ioctls and 
then calls back into the application. 


2.2.1 Implementation alternatives 


We considered a number of mechanisms with which to 
implement libcm. In this section, we discuss our rea- 
sons for choosing the control-socket+select+ioctl ap- 
proach. 

While much research has focused on reducing the 
cost of crossing the user/kernel boundary (extensible 
kernels in SPIN [7], fast, generic IPC in Mach [5], etc.) 
many conventional operating systems remain limited 
to more primitive methods for kernel-to-user notifica- 
tion, each with their own advantages and disadvan- 
tages. While functionality like the Mach port set-based 


IPC would be ideal for our purposes, pragmatically we 
considered four common mechanisms for kernel to user 
communication: Signals, system calls, semaphores, and 
sockets. A discussion of the merits of each follows. 

Signals have several immediate drawbacks. First, 
if the CM were to appropriate an existing signal for 
its own use, it might conflict with an application us- 
ing the same signal. Avoiding this conflict would re- 
quire the standardization of a new signal type, a pro- 
cess both slow and of questionable value, given the ex- 
istence of better alternatives. Second, the cost to an 
application to receive a signal is relatively high, and 
some legacy applications may not be signal-safe. While 
the new POSIX 1003.1b [17] soft realtime signals allow 
delivering a 32-bit quantity with a signal, applications 
would need to follow up a signal with a system call to 
obtain all of the information the kernel wished to de- 
liver, since multiple flows may become ready at once. 
For these reasons, we consider mandating the use of 
signals the wrong course for implementing the kernel 
to user callbacks. However, we provide an option for 
processes to receive a SIGIO when their control socket 
status changes, akin to POSIX asynchronous I/O. 

System calls that block do not integrate well with 
applications that already have their own event loop, 
since without polling, applications cannot wait on the 
results of multiple system calls. A system call is able 
to return immediately with the data the user needs, 
but the impediments it poses to application integration 
are large. System calls would work well in a threaded 
environment, but this presupposes threading support, 
and the select-based mechanism we describe below can 
be used in a threaded system without major additional 
overhead. 

Semaphores suffer from the immediate drawback 
that they are not commonly used in network applica- 
tions. For an application that uses semop on an ar- 
ray of semaphores as its event loop, a CM semaphore 
might be the best implementation avenue, for many of 
the same reasons that we chose sockets for network- 
adaptive applications. However, most network appli- 
cations use socket sets instead of semaphore sets, and 
sockets have a few other benefits, which we discuss next. 

Sockets provide a well-defined and flexible inter- 
face for applications in the form of the select() sys- 
tem call, though they have a downside similar to that of 
signals: an application wishing to receive a notification 
via a socket in a non-blocking manner must select () 
on the socket, and then perform a system call to obtain 
data from the socket. However, a select-based inter- 
face meshes well with many network applications that 
already have a select-loop based architecture. Utiliz- 
ing a control socket also helps restrict the code changes 
caused by the CM to the networking stack. 

Finally, we decided to use a single control socket 
instead of one control socket per flow to avoid unnec- 
essary overhead in applications with large numbers of 
open socket descriptors, such as select ()-based web- 
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servers and caches. Because some aspects of select scale 
linearly with the number of descriptors, and many op- 
erating systems have limits on the number of open de- 
scriptors, we deemed doubling the socket load for high- 
performance network applications a bad idea. 


2.2.2 Extracting data from the socket 


Select provides notification that “some event” has oc- 
cured. In theory, 7 different events could be sent by 
abusing the read, write, and exception bits, but ap- 
plications need to extract more information than this. 
The CM provides two types of callbacks. Generally 
speaking, the first is a “permission to send” callback 
for a particular flow. To maintain even distribution 
of bandwidth between flows, a loose ordering should 
be preserved with these messages, but exact ordering 
is unimportant provided no flows are ignored until the 
application receives further updates (thereby starving 
the flows). If multiple permission notifications occur, 
the application should receive all of them so it can 
send data on all available flows. The second callback 
is a “status changed” notification. If multiple status 
changes occur before the application obtains this data 
from the kernel, then only the current status matters. 

The weak ordering and lack of history prompted 
us to choose an ioctl-based query instead of a read 
or message queue interface, minimizing the state that 
must be maintained in the kernel. Status updates sim- 
ply return the current CM-maintained network state 
estimate, and “who can send” queries perform a select- 
like operation on the flows maintained by the kernel, 
requiring no extra state, instead of a potentially expen- 
sive per-process message queue or data stream. Return- 
ing all available flows has an added benefit of reducing 
the number of system calls that must be made if several 
flows become ready simultaneously. 


3 Engineering Network-adaptive Appli- 
cations 


In this section, we describe several different classes of 
applications, and describe the ways those applications 
can make use of the CM. We explore two in-kernel 
clients, and several user-space data server programs, 
and examine the task of integrating each with the CM. 


3.1 Software Architecture Issues 


Typical network applications fall into one of several cat- 
egories: 


e Data-driven: Applications that transmit prespec- 
ified data, such as a single file, then exit. 


e Synchronous event-driven: Self-timed data deliv- 
ery servers, like streaming audio servers. 
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e Asynchronous event-driven: File servers (http, 
ftp) and other network-clocked applications. 


The CM library provides several options for adap- 
tive applications that wish to make use of its services: 


1. Data-driven applications may use the buffered API 
to efficiently pace their data transmissions. 


2. An application may operate in an_ entirely 
callback-based manner by allowing libcm to pro- 
vide its own event loop, calling into the applica- 
tion when flows are ready. This is most useful for 
applications coded with the CM in mind. 


3. Signal-driven applications may request a SIGIO 
notification from the CM when an event occurs. 


4. Applications with select-based event loops can 
simply add the CM control socket into their select 
set, and call the 1ibcm dispatcher when the socket 
is ready. Rate-clocked applications (or polling- 
based applications) can perform a similar non- 
blocking select test on the descriptor when they 
awaken to send data, or, if they sleep, can re- 
place the sleep with a timed blocking select call. 


5. Applications may poll the CM on their own sched- 
ule. 


The remainder of this section describes how par- 
ticular clients use different CM APIs, from the low- 
bandwidth vat audio application, to the performance- 
critical kernel TCP implementation. Note that all 
UDP-based clients must implement application level 
data acknowledgements in order to make use of the CM. 


3.2 TCP 


We implemented TCP as an in-kernel CM client. 
TCP/CM offloads all congestion control to the CM, 
while retaining all other TCP functionality (connection 
establishment and termination, loss recovery and pro- 
tocol state handling). TCP uses the request/callback 
API as low-overhead direct function calls in the same 
protection domain. This gives TCP the tight control 
it needs over packet scheduling. For example, while 
the arrival of a new acknowledgement typically causes 
TCP to transmit new data, the arrival of three dupli- 
cate ACKs causes TCP to retransmit an old packet. 
Connection creation. When TCP creates a new 
connection via either accept (inbound) or connect 
(outbound), it calls cm-open() to associate the TCP 
connection with a CM flow. Thereafter, the pac- 
ing of outgoing data on this connection is controlled 
by the CM. When application data becomes avail- 
able, after performing all the non-congestion-related 
checks (e.g., the Nagle algorithm [40], etc.) data is 
queued and cm_request () is called for the flow. When 
the CM scheduler schedules the flow for transmission, 


USENIX Association 


the cmapp-send() routine for TCP is called. The 
cmapp-send() for TCP transmits any retransmission 
from the retransmission queue. Otherwise, it transmits 
the data present in the transmit socket buffer by send- 
ing up to one maximum segment size of data per call. 
Finally, the IP output routine calls cmnotify() when 
the data is actually sent out. 

TCP input. The TCP input routines now feed- 
back to the CM. Round trip time (RTT) sample col- 
lection is done as usual using either RFC 1323 times- 
tamps (19] or Karn’s algorithm [21] and is passed to CM 
via cm_update(). The smoothed estimates of the RTT 
(srtt) and round-trip time deviation are calculated by 
the CM, which can now obtain a better average by com- 
bining samples from different connections to the same 
receiver. This is available to each TCP connection via 
cm_query(), and is useful in loss recovery. 

Data acknowledgements. On arrival of an ACK 
for new data, the TCP sender calls cm_update() to in- 
form the CM of a successful transmission. Duplicate ac- 
knowledgements cause TCP to check its dupack count 
(dup.acks). If dup.acks < 3, then TCP does noth- 
ing. If dup-acks == 3, then TCP assumes a simple, 
congestion-caused packet loss, and calls cm_-update to 
inform the CM. TCP also enqueues a retransmission of 
the lost segment and calls cm_request(). If dup_acks 
> 3, TCP assumes that a segment reached the receiver 
and caused this ACK to be sent. It therefore calls 
cm_update(). Unlike duplicate ACKs, the expiration 
of the TCP retransmission timer notifies the sender of 
a more serious batch of losses, so it calls cm-update 
with the CM_LOST.FEEDBACK option set to signify 
the occurrence of persistent congestion to the CM. TCP 
also enqueues a retransmission of the lost segment and 
calls cm_request (). 

TCP/CM Implementation. The integration of 
TCP and the CM required less than 100 lines of changes 
to the existing TCP code, demonstrating both the flexi- 
bility of the CM API and the low programmer overhead 
of implementing a complex protocol with the Conges- 
tion Manager. 


3.3 Congestion-controlled UDP sockets 


The CM also provides congestion-controlled UDP sock- 
ets. They provide the same functionality as standard 
Berkeley UDP sockets, but instead of immediately send- 
ing the data from the kernel packet queue to tower lay- 
ers for transmission, the buffered socket implementation 
schedules its packet output via CM callbacks. When a 
CM UDP socket is created, it is bound to a particu- 
lar flow. When data enters the packet queue, the ker- 
nel calls cm_request () on the flow associated with the 
socket. When the CM schedules this flow for transmis- 
sion, it calls udp_ccappsend() in the CM UDP mod- 
ule. This function transmits one MTU from the packet 
queue, and requests another callback if packets remain. 
The in-kernel implementation of the CM UDP API adds 


no data copies or queue structures, and supports all 
standard UDP options. Modifying existing applications 
to use this API requires only providing feedback to the 
CM, and setting a socket option on the socket. 

A typical client of the CM UDP sockets will behave 
as follows, after its usual network socket initialization: 


flow = cm_open(dst, port) 
setsockopt(flow, ..., CM_BUF) 
loop: 
<send data on flow> 
<receive data acknowledgements> 
cm_update(flow, sent, received, ...) 


3.4 Streaming Layered Audio and 
Video 


Streaming layered audio or video applications that have 
a number of discrete rates at which they can transmit 
data are well-served by the CM rate callbacks. Instead 
of requiring a comparatively expensive notification for 
each transmission, these applications are instead noti- 
fied only in the rare event that their network condi- 
tions change significantly. Layered applications open 
their usual UDP socket, and call cm_open() to obtain 
a control socket. They operate in their own clocked 
event loop while listening for status changes on either 
their contro! socket or via a SIGIO signal. They use 
cm_thresh() to inform the CM about network changes 
for which they should receive callbacks. 


3.5 Real-time Adaptive Applications 


Applications that desire last-minute control over their 
data transmission (ie. those that do not want any 
buffering inside the kernel) use the request callback 
API provided by the CM. When given permission to 
transmit via the cmapp-send() callback from the CM, 
they may use cm_query() to discover the current net- 
work conditions and adapt their content based on that. 
Other servers may simply wish to send the most up- 
to-date content possible, and so will defer their data 
collection until they know they can send it. The rough 
sequence of CM calls that are made to achieve this in 
the application are: 


flow = cm_open(dst) 

cm_request (flow) 

<receive cmapp_send() callback from libcm> 
cm_query(flow, ...) 

<send data> 

<receive data acks> 

cm_update(flow, sent, lost, ...) 


Other options exist for applications that wish to ex- 
ploit the unique nature of their network utilization to 
reduce the overhead of using the services of the Conges- 
tion Manager. We discuss one such option below in the 
manner in which we adapted the vat interactive audio 
application to use the CM. 
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Figure 2. The adaptive vat architecture 


3.6 Interactive Real-time Audio 


The vat application provides a constant bit-rate source 
of interactive audio. Its inability to downsample its au- 
dio reduces the avenues it has available for bandwidth 
adaptation. Therefore, the best way to make vat behave 
in a network-friendly and backwards compatible man- 
ner is to preemptively drop packets to match the avail- 
able network bandwidth. There are, of course, compli- 
cations. Network applications experience two types of 
variation in available network bandwidth: long term 
variations due to changes in actual bandwidth, and 
short term variations due to the probing mechanisms 
of the congestion control algorithm. Short-term varia- 
tion is typically dealt with by buffering. Unfortunately, 
buffering, especially FIFO buffering with drop-tail be- 
havior, the de-facto standard for kernel buffers and net- 
work router buffers, can result in long delay and signif- 
icant delay variation, both of which are detrimental to 
vat’s audio quality. Vat , therefore, needs to act like an 
ALF application, managing its own buffer space with 
drop-from-head behavior when the queue is full. 

The resulting architecture is detailed in figure 2. 
The input audio stream is first sent to a policer, which 
provides long-term adaptation via preemptive packet 
dropping. The policer outputs into the application level 
buffer, which can be configured in various sizes and 
drop policies. This buffer feeds into the kernel buffer 
on-demand as packets are available for transmission. 


4 Evaluation 


This section describes several experiments that quantify 
the costs and benefits of our CM implementation. Our 
experiments show that using the Congestion Manager in 
the kernel has minimal costs, and that even the worst- 
case overhead of the request/callback user-space API is 
acceptably small. 

The tests were performed on the Utah Network 
Testbed [22] using 600MHz Intel Pentium III proces- 
sors, 128MB PC100 ECC SDRAM, and Intel EtherEx- 
press Pro/100B Ethernet cards, connected via 1OOMbps 
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Figure 3. Comparing throughput vs. loss for 


TCP/CM and TCP/Linux. Rates are for a 1OMbps 
link with a 60ms RTT. 


Ethernet through an Intel Express 510T switch, with 
Dummynet channel simulation. CM tests were run on 
Linux 2.2.9, with Linux and FreeBSD clients. 

To ensure the proper behavior of a flow, the con- 
gestion control algorithm must behave in a “TCP- 
compatible” [8] manner. The CM implements a TCP- 
style window-based AIMD algorithm with slow start. 
It shares bandwidth between eligible flows in a round- 
robin manner with equal weights on the flows. 

Figure 3 shows the throughput achieved by 
the Linux TCP implementation (TCP/Linux) and 
TCP with congestion control performed by the CM 
(TCP/CM). The linux kernel against which we com- 
pare has two algorithmic differences from the Conges- 
tion Manager: It starts its initial window at 2 packets, 
and it assumes that each ACK is for a full MTU. The 
Congestion Manager instead performs byte-counting for 
its AIMD algorithm. The first issue is Linux-specific, 
and the last is a feature of the CM. 


4.1 Kernel Overhead 


To measure the kernel overhead, we measured the 
CPU and throughput differences between the optimized 
TCP/Linux and TCP/CM. The midrange machines 
used in our test environment are sufficiently powerful 
to saturate a 100Mbps Ethernet with TCP traffic. 

There are two components to the overhead im posed 
by the congestion manager: The cost of performing ac- 
counting as data is exchanged on a connection, and a 
one-time connection setup cost for creating CM data 
structures. A microbenchmark of the connection es- 
tablishment time of a TCP /CM vs. TCP /Linux indi- 
cates that there is no appreciable difference in connec- 
tion setup times. 

We used long (megabytes to gigabytes) connections 
with the ttcp utility to determine the long-term costs 
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Figure 4. 100Mbps TCP throughput comparison. 
Note that the absolute difference in the worst case be- 
tween the Congestion Manager and the native TCP is 
only 0.5% and that the Y axis begins at 11 megabytes 
per second. 


imposed by the congestion manager. The impact of the 
CM on extremely long term throughput was negligi- 
ble: in a 1 gigabyte transfer, the congestion manager 
achieved identical performance (91.6 Mbps) as native 
Linux. On shorter runs, the throughput of the CM di- 
verged slightly from that of Linux, but only by 0.5%. 
The throughput rates are shown in figure 4. The dif- 
ference is due to the CM using an initial window of 1 
MTU and Linux using 2 MTU, not CPU overhead. 

Because both implementations are able to saturate 
the network connection, we looked at the CPU uti- 
lization during these transmissions to determine the 
steady-state overhead imposed by the Congestion Man- 
ager. In figure 5 we see that the CPU difference be- 
tween TCP /Linux and TCP/CM converges to slightly 
less than 1%. 


4.2 User-space API Overhead 


The overhead incurred by our adaptation API occurs 
primarily because the applications must process their 
ACKs in user-space instead of in the kernel. Therefore, 
these programs incur extra data copies and user/kernel 
boundary crossings. To quantify this overhead, our 
test programs sent packets of specified sizes on a UDP 
socket, and waited for acknowledgement packets from 
the server. We compare these programs to a webserver- 
like TCP client which sendt data to the server, and 
performed a select() on its socket to determine if the 
server has sent any data back. To facilitate compari- 
son, we disabled delayed ACKs for the one TCP test to 
ensure that our packet counts were identical. 

Figure 6 shows the wall-clock time required to send 
and process the acknowledgement for a packet, based on 
transmitting 200,000 packets. For comparison, we in- 
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TCP /Linux and TCP/CM. For long connections, the 
CPU overhead converges to slightly under 1% for the 
unoptimized implementation of the CM. 
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l recv, 2 gettimeofday 





Table 1. Cumulative sources of overhead for different 
APIs using the Congestion Manager relative to sending 
data with TCP. 


clude TCP statistics as well, where the TCP programs 
set the maximum segment size to achieve identical net- 
work performance. The “nodelay” variant is TCP with- 
out delayed acks. The tests were run on a 100Mbps 
network on which no losses occured. 

Table 1 breaks down the sources of overhead for us- 
ing the different APIs. Using the CM with UDP re- 
quires that applications compute the round-trip-time 
(RTT) of their packets, requiring a system call to 
gettimeofday, and requires that they process their 
ACKs in user-space, requiring a system call to recv and 
the accompanying data copy into their address space. 
The ALF API further requires that the application ob- 
tain an additional control socket and select upon it, 
and that it make an explicit call to cm_request before 
transmitting data. Finally, if the kernel is unable to de- 
termine the flow to which to charge the transmission, as 
with an unconnected UDP socket, the application must 
explicitly call cnnotify 

These test cases represent the worst-case behavior 
of serving a single high-bandwidth client, because no 
aggregation of requests to the CM may occur. The CM 
programs can achieve similar reductions in processing 
time by using delayed acks, so the real API overhead 
can be determined by comparing the ALF/noconnect 
case to the TCP/CM case. For 168 byte packets, 
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Figure 6. API throughput comparison on a 100Mbps 
link. The worst-case throughput reduction incurred 
by the CM is 25% from TCP/CM nodelay to 
ALF /noconnect. 


ALF /noconnect results in a 25% reduction in through- 
put relative to TCP without delayed ACKs. 


4.3. Benefits of Sharing 


One benefit of integrating congestion information with 
the CM is immediately clear. A client that sequen- 
tially fetches files from a webserver with a new TCP 
connection each time loses its prior congestion infor- 
mation, but with concurrent connections with the CM, 
the server is able to use this information to start subse- 
quent connections with more accurate congestion win- 
dows. Figure 7 shows a test we performed across the 
vBNS between MIT and the University of Utah, where 
an unmodified (non-CM) client performed 9 retrievals 
of the same 128k file with a 500ms delay between re- 
trievals, resulting in a 40% improvement in the transfer 
time for the later requests. (Other file sizes and delays 
yield similar results, so long as they overlap. The ben- 
efits are comparatively greater for smaller files). The 
CM requires an additional RTT ( 75ms) for the first 
transfer, because Linux sets its initial congestion win- 
dow to 2 MTUs instead of 1. This pattern of mul- 
tiple connections is still quite common in webservers 
despite the adoption of persistent connections: Many 
browsers open 4 concurrent connections to a server, and 
many client/server combinations do not support persis- 
tent connections. Persistent connections [28] provide 
similar performance benefits, but suffer from their own 
drawbacks, which we discuss in section 6. 


4.4 Adaptive Applications 


In this section, we demonstrate some of the network 
adaptive behaviors enabled by the CM. 
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Figure 7. Sharing TCP state: The client requests 
the same file 9 times with a 500ms delay between re- 
quest initiations. By sharing congestion information 
and avoiding slow-start, the CM-enabled server is able 
to provide faster service for subsequent requests, despite 
a smaller initial congestion window. 


As noted earlier, applications that require tight 
control over data scheduling use the request/callback 
(ALF) API, and are notified by the CM as soon as they 
can transmit data. The behavior of an adaptive layer- 
ing application run across the vBNS using this API is 
shown in figure 8. This application chooses a layer to 
transmit based upon the current rate, but sends pack- 
ets as rapidly as possible to allow its client to buffer 
more data. We see that the CM is able to provide suffi- 
cient information to the application to allow it to adapt 
properly to the network conditions. 

For self-clocked applications that base their trans- 
mitted data upon the bandwidth to the client (such as 
conventional layered audio servers), the CM rate call- 
back mechanism provides a low-overhead mechanism for 
adaptation, and allows clients to specify threshholds 
for the notification callbacks. Figure 9 shows appli- 
cation adaptation using rate callbacks for a connection 
between MIT and the University of Utah. Here, the ap- 
plication decides which of the four layers it should send 
based on notifications from the CM about rate changes. 

From figures 8 and 9, we see from the increased oscil- 
lation rate in the transmitted layer that the ALF appli- 
cation is more responsive to smaller changes in available 
bandwidth, whereas the rate callback application relies 
occasionally on short-term kernel buffering for smooth- 
ing. There is an overhead vs. functionality trade-off 
in the decision of which API to use, given the higher 
overhead of the ALF API, but applications face a more 
important decision about the behavior they desire. 

Some applications may be concerned about the over- 
head from receiver feedback. To mitigate this, an ap- 
plication may delay sending feedback; we see this in a 
minor and inflexible way with TCP delayed acks. In 
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Figure 8. Bandwidth perceived by an adaptive layered 
application using the request callback (ALF) API. 
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Figure 9. Bandwidth perceived by an adaptive layered 
application using the rate callback API. 


figure 10, we see that delaying feedback to the CM 
causes burstiness in the reported bandwidth. Here, the 
feedback by the receiver was delayed by min(500 acks, 
2000ms). The initial slow start is delayed by 2s wait- 
ing for the application, then the update causes a large 
rate change. Once the pipe is sufficiently full, 500 acks 
come relatively rapidly, and the normal, though bursty, 
non-timeout behavior resumes. 


5 Discussion 


We have shown several benefits of integrated flow man- 
agement and the adaptation API, and have explored the 
design features that make the API easy to use. This sec- 
tion describes an optimization useful for busy servers, 
and discusses some drawbacks and limitations of the 
current CM architecture. 


USENIX Association 


eas app using CM with delayed feedbacks (min(500packets,2s)) 


Transmission Rate ——— 
Rate reported by CM ---------- 





2500 


2000 


1500 


Rate (in KBps) 


1000 


er 


500 


0 10 20 30 40 50 60 70 
Time (in sec) 


Figure 10. Adaptive layered application using rate 
callback API with delayed feedback ; 


Optimizations. Servers with large numbers of con- 
current clients are often very sensitive to the overhead 
caused by multiple kernel boundary crossings. To re- 
duce this overhead, we can batch several sockets into 
the same cm_request call with the cm_bulk_request 
call, and likewise for query, notify, and update calls. 

By multiplexing control information for many sock- 
ets on each CM call, the overhead from kernel crossings 
is mitigated at the expense of managing more compli- 
cated data structures for the CM interface. Bulk query- 
ing is already performed in libcm when multiple flows 
are ready during a single ioctl to determine which 
flows can send data, but this completes the interface. 

Trust issues. Because our goal was an architec- 
ture that did not require modifications to receivers, we 
devised a system where applications provide feedback 
using the cm-update() call. The consequence of this is 
that there is a potential for misuse, due to bugs or mal- 
ice. For example, the CM client could repeatedly mis- 
inform the CM about the absence of congestion along 
a path and obtain higher bandwidth. This does not 
increase the vulnerability of the Internet to such prob- 
lems, because such abuse is already trivial. More im- 
portant are situations where users on the same machine 
could potentially interfere with each other. To prevent 
this, the Congestion Manager would need to ensure that 
only kernel-mediated (e.g. TCP) flows belonging to dif- 
ferent users can belong in the same macroflow. Our 
current implementation does not make an attempt to 
provide this protection. Savage [37] presents several 
methods by which a malicious receiver can defeat con- 
gestion control. The solutions he proposes can be easily 
used with the CM; we have already implemented byte- 
counting to prevent ACK division. 

Macroflow construction. When differentiated 
services, or any system which provides different service 
to flows between the same pair of hosts, start being de- 
ployed, the CM would have to reconsider the default 
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choice of a macroflow. We expect to be able to gain 
some benefit by including the IP differentiated-services 
field in deciding the composition of a macroflow. 

Finally, we observe that remote LANs are not often 
the bottleneck for an outside communicator. As sug- 
gested in (42, 36] among others, aggregating congestion 
information about remote sites with a shared bottleneck 
and sharing this information with local peers may bene- 
fit both users and the network itself. A macroflow may 
thus be extended to cover multiple destination hosts 
behind the same shared bottleneck link. Efficiently de- 
termining such bottlenecks remains an open research 
problem. 

Limitations. The current CM architecture is de- 
signed only to handle unicast flows. The problem of 
congestion control for multicast flows is a much more 
difficult problem which we deliberately avoid. UDP ap- 
plications using the CM are required to perform their 
own loss detection, requiring potential additional appli- 
cation complexity. Implementing the Congestion Man- 
ager protocol discussed in [3] would eliminate this need, 
but remains to be studied. 


6 Related work 


Designing adaptive network applications has been an 
active area of research for the past several years. In 
1990, Clark and Tennenhouse [11] advocated the use 
of application-level framing (ALF) for designing net- 
work protocols, where protocol data units are chosen 
in concert with the application. Using this approach, 
an application can have a greater influence over decid- 
ing how loss recovery occurs than in the traditional lay- 
ered approach. The ALF philosophy has been used with 
great benefit in the design of several multicast transport 
protocols including the Real-time Transport Protocol 
(RTP) [38], frameworks for reliable multicast [14, 33], 
and Internet video (24, 35]. 

Adaptation APIs in the context of mobile informa- 
tion access were explored in the Odyssey system [26]. 
Implemented as a user-level module in the NetBSD op- 
erating system, Odyssey provides API calls by which 
applications can manage system resources, with upcalls 
to applications informing them when changes occur in 
the resources that are available. In contrast, our CM 
system is implemented in-kernel since it has to manage 
and share resources across applications (e.g., TCP) that 
are already in-kernel. This necessitates a different ap- 
proach to handling application callbacks. In addition, 
the CM approach to measuring bandwidth and other 
network conditions is tied to the congestion avoidance 
and control algorithms, as compared to the instrumen- 
tation of the user-level RPC mechanism in Odyssey. 
We believe that our approach to providing adaptation 
information for bandwidth, round-trip time, and loss 
rate complements Odyssey’s management of disk space, 
CPU, and battery power. 
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The CM system uses application callbacks or up- 
calls as an abstraction, an oldideain operating systems. 
Clark describes upcalls in the Swift operating system, 
where the motivation is a lower layer of a protocol stack 
synchronously invoking a higher-layer function across a 
protection boundary [9]. The Mach system used the no- 
tion of ports, a generic communication abstraction for 
fast inter-process communication (IPC). POSIX speci- 
fies a standard way of passing “soft real-time signals” 
that can be used to send a notification to a user-level 
process, but it restricts the amount of data that can be 
communicated to a 32-bit quantity. 

Event delivery abstractions for mobile computing 
have been explored in [1], where “monitored” events 
are tracked using polling and “triggered” events (e.g., 
PC card insertion) are notified using IPC. This work 
defines a language-level mechanism based on C++ ob- 
jects for event registration, delivery, and handling. This 
system is implemented in Mach using ports for IPC. 

Our approach is to use a select() call on a con- 
trol socket to communicate information between kernel 
and user-level. The recent work of Banga et al. [4] to 
improve the performance of this type of event delivery 
can be used to further improve our performance. 

The Microsoft Winsock implementation is largely 
callback-based, but here callbacks are implemented as 
conventional function calls since Winsock is a user-level 
library within the same protection boundary as the ap- 
plication [31]. The main reason we did not implement 
the CM as a user-level daemon was because TCP is al- 
ready implemented in-kernel in most UNIX operating 
systems, and it is important to share network informa- 
tion across TCP flows. 

Quality-of-service (QoS) interfaces have been ex- 
plored in several operating systems, including Neme- 
sis [16]. Like the exokernel approach [20] and SPIN [7], 
Nemesis enables applications to perform as much of the 
processing as possible on their own using application- 
specific policy, supported by a set of operating system 
abstractions different from those in UNIX. Whereas 
Nemesis treats local network-interface bandwidth as the 
resource to be managed, we take a more end-to-end ap- 
proach of discovering the end-to-end performance to dif- 
ferent end-hosts, enabling sharing across common net- 
work paths. Furthermore, the API exported by Nemesis 
is useful for applications that can make resource reser- 
vations, while the CM API provides information about 
network conditions. Some “web switches” [?] provide 
traffic shaping and QoS based upon application infor- 
mation, but do not provide integrated flow management 
or feedback to the applications creating the data. 

Multiple concurrent streams can cause problems for 
TCP congestion control. First, the ensemble of flows 
probes more aggressively for bandwidth than a single 
flow. Second, upon experiencing congestion along the 
path, only a subset of the connections usually reduce 
their window. Third, these flows do not share any in- 
formation between each other. While we propose a gen- 
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eral solution to these problems, application-specific so- 
lutions have been proposed in the literature. Of partic- 
ular importance are approaches that multiplex several 
logically distinct streams onto a single TCP connection 
at the application level, including Persistent-connection 
HTTP (P-HTTP [28], part of HTTP/1.1 {12]), the Ses- 
sion Control Protocol (SCP) [39], and the MUX pro- 
tocol [15]. Unfortunately, these solutions suffer from 
two important drawbacks. First, because they are 
application-specific, they require each class of applica- 
tions (Web, real-time streams, file transfers, etc.) to re- 
implement much of the same machinery. Second, they 
cause an undesirable coupling between logically differ- 
ent streams: if packets belonging to one stream are lost, 
another stream could stall even if none of its packets 
are lost. because of the in-order “linear” delivery forced 
by TCP. Independent data units belonging to different 
streams are no longer independently processible and the 
parallelism of downloads is often lost. 


7 Conclusion 


The CM system enables applications to obtain an un- 
precedented degree of control over what they can do 
in response to different network conditions. It incorpo- 
rates robust congestion control algorithms, freeing each 
application from having to re-implement them. It ex- 
poses a rich API that allows applications to adapt their 
transmissions at a fine-grained level, and allows the ker- 
nel and applications to integrate congestion information 
across flows. 

Our evaluation of the CM implementation shows 
that the callback interface is effective for a variety of ap- 
plications, and does not unduly burden the programmer 
with restrictive interfaces. From a performance stand- 
point, the CM itself imposes very little overhead; that 
which remains is mostly due to the unoptimized nature 
of our implementation. The architecture of programs 
implemented using UDP imposes some additional over- 
head, but the cost of using the CM after this architec- 
tural conversion is quite small. 

Many systems exist to deliver content over the In- 
ternet using TCP or home-grown UDP protocols. We 
believe that by providing an accessible, robust frame- 
work for congestion control and adaptation, the Con- 
gestion Manager can help improve both the implemen- 
tation and performance of these systems. 

The Congestion Manager implementation for Linux 
is available from our web page, http://nms.1cs.mit. 
edu/projects/cn/. 
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Abstract 


MEMS-based storage devices promise significant 
performance, reliability, and power improvements 
relative to disk drives. This paper compares and 
contrasts these two storage technologies and ex- 
plores how the physical characteristics of MEMS- 
based storage devices change four aspects of oper- 
ating system (OS) management: request schedul- 
ing, data placement, failure management, and power 
conservation. Straightforward adaptations of exist- 
ing disk request scheduling algorithms are found to 
be appropriate for MEMS-based storage devices. A 
new bipartite data placement scheme is shown to 
better match these devices’ novel mechanical posi- 
tioning characteristics. With aggressive internal re- 
dundancy, MEMS-based storage devices can mask 
and tolerate failure modes that halt operation or 
cause data loss for disks. In addition, MEMS-based 
storage devices simplify power management because 
the devices can be stopped and started rapidly. 


1 Introduction 


Decades of research and experience have provided 
operating system builders with a healthy under- 
standing of how to manage disk drives and their 
role in storage systems. This management includes 
achieving acceptable performance despite relatively 
time-consuming mechanical positioning delays, deal- 
ing with transient and permanent hardware prob- 
lems so as to achieve high degrees of data surviv- 
ability and availability, and minimizing power dissi- 
pation in battery-powered mobile environments. To 
address these issues, a wide array of OS techniques 
are used, including request scheduling, data layout, 
prefetching, caching, block remapping, data replica- 
tion, and device spin-down. Given the prevalence 
and complexity of disks, most of these techniques 
have been specifically tuned to their physical char- 
acteristics. 


When other devices (e.g., magnetic tape or Flash 
RAM) are used in place of disks, the characteristics 
of the problems change. Putting new devices behind 


a disk-like interface is generally sufficient to achieve 
a working system. However, OS management tech- 
niques must be tuned to a particular device’s char- 
acteristics to achieve the best performance, reliabil- 
ity, and lifetimes. For example, request schedul- 
ing techniques are much less important for RAM- 
based storage devices than for disks, since location- 
dependent mechanical delays are not involved. Like- 
wise, locality-enhancing block layouts such as cylin- 
der groups [18], extents [19], and log-structuring [24] 
are not as beneficial. However, log-structured file 
systems with idle-time cleaning can increase both 
performance and device lifetimes of Flash RAM stor- 
age devices with large erasure units [5]. 


Microelectromechanical systems (MEMS)-based 
storage is an exciting new technology that will 
soon be available in systems. MEMS are very 
small scale mechanical structures—on the order of 
10-1000 wzm—fabricated on the surface of silicon 
wafers [33]. These microstructures are created using 
photolithographic processes much like those used 
to manufacture other semiconductor devices (e.g., 
processors and memory) [7]. MEMS structures can 
be made to slide, bend, and deflect in response 
to electrostatic or electromagnetic forces from 
nearby actuators or from external forces in the 
environment. Using minute MEMS read/write 
heads, data bits can be stored in and retrieved 
from media coated on a small movable media 
sled [{1, 11, 31]. Practical MEMS-based storage 
devices are the goal of major efforts at many 
research centers, including IBM Zurich Research 
Laboratory [31], Carnegie Mellon University [1], 
and Hewlett-Packard Laboratories [13]. 


Like disks, MEMS-based storage devices have unique 
mechanical and magnetic characteristics that merit 
specific OS techniques to manage performance, fault 
tolerance, and power consumption. For example, 
the mechanical positioning delays for MEMS-based 
storage devices depend on the initial and destina- 
tion position and velocity of the media sled, just as 
disks’ positioning times are dependent on the arm 
position and platter rotational offset. However, the 
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mechanical expressions that characterize sled motion 
differ from those describing disk platter and arm mo- 
tion. These differences impact both request schedul- 
ing and data layout trade-offs. Similar examples ex- 
ist for failure management and power conservation 
mechanisms. To assist designers of both MEMS- 
based storage devices and the systems that use them, 
an understanding of the options and trade-offs for 
OS management of these devices must be developed. 


This paper takes a first step towards developing this 
understanding of OS management techniques for 
MEMS-based storage devices. It focuses on devices 
with the movable sled design that is being devel- 
oped independently by several groups. With higher 
storage densities (260-720Gbit/in?) and lower ran- 
dom access times (less than 1 ms) than disks, these 
devices could play a significant role in future sys- 
tems. After describing a disk-like view of these de- 
vices, we compare and contrast their characteristics 
with those of disks. Building on these comparisons, 
we explore options and implications for three major 
OS management issues: performance (specifically, 
request scheduling and block layout), failure man- 
agement (media defects, device failures, and host 
crashes), and power conservation. 


While these explorations are unlikely to represent 
the final word for OS management of these emerg- 
ing devices, we believe that several of our high-level 
results will remain valid: 


e Disk scheduling algorithms can be easily 
adapted to MEMS-based storage devices, im- 
proving performance much like they do for 
disks. 


Disk layout techniques can be adapted usefully, 
but the Cartesian movement of the sled (instead 
of the rotational motion of disks) allows further 
refinement of layouts. 


Striping of data and error correcting codes 
(ECC) across tips can greatly increase a device’s 
tolerance to media, tip, and electronics faults; 
in fact, many faults that would halt operation 
or cause data loss in disks can be masked and 
tolerated in MEMS-based storage devices. 


e OS power conservation is much simpler for 
MEMS-based storage devices. In particular, 
miniaturization and lack of rotation enable 
rapid transition between power-save and active 
modes, obviating the need for complex idle-time 
prediction and maximization algorithms. 
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The remainder of this paper is organized as follows. 
Section 2 describes MEMS-based storage devices, fo- 
cusing on how they are similar to and different from 
magnetic disks. Section 3 describes our experimen- 
tal setup, including the simulator and the workloads 
used. Section 4 evaluates request scheduling algo- 
rithms for MEMS-based storage devices. Section 5 
explores data layout options. Section 6 describes 
approaches to fault management within and among 
MEMS-based storage devices. Section 7 discusses 
other device characteristics impacting the OS. Sec- 
tion 8 summarizes this paper’s contributions. 


2 MEMS-based storage devices 


This section describes a MEMS-based storage device 
and compares and contrasts its characteristics with 
those of conventional disk drives. The description, 
which follows that of Reference {11], maps these de- 
vices onto a disk-like metaphor appropriate to their 
physical and operational characteristics. 


2.1 Basic device description 


MEMS-based storage devices can use the same ba- 
sic magnetic recording technologies as disks to read 
and write data on the media. However, because it 
is difficult to build reliable rotating components in 
silicon, MEMS-based storage devices are unlikely to 
use rotating platters. Instead, most current designs 
contain a movable sled coated with magnetic me- 
dia. This media sled is spring-mounted above a two- 
dimensional array of fixed read/write heads (probe 
tips) and can be pulled in the X and Y dimensions 
by electrostatic actuators along each edge. To ac- 
cess data, the media sled is first pulled to a specific 
location (x,y displacement). When this seek is com- 
plete, the sled moves in the Y dimension while the 
probe tips access the media. Note that the probe 
tips remain stationary—except for minute X and 
Z dimension movements to adjust for surface vari- 
ation and skewed tracks—while the sled moves. In 
contrast, rotating platters and actuated read/write 
heads share the task of positioning in disks. Fig- 
ures 1 and 2 illustrate this MEMS-based storage de- 
vice design. 


As aconcrete example, the footprint of one MEMS- 
based storage device design is 196 mm?, with 64 mm? 
of usable media area and 6400 probe tips [1]. Divid- 
ing the media into bit cells of 40x40nm, and ac- 
counting for an ECC and encoding overhead of 2 
bits per byte, this design has a formatted capac- 
ity of 3.2GB/device. Note the square nature of the 
bit cells, which is not the case in conventional disk 
drives. With minute probe tips and vertical record- 
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Figure 1: Components of a MEMS-based storage 
device. The media sled is suspended above an array of 
probe tips. The sled moves small distances along the X 
and Y azes, allowing the stationary tips to address the 
media. 


ing, bits stored on these devices can have a 1-to-1 
aspect ratio, resulting in areal densities 15-30 times 
greater than those of disks. However, per-device ca- 
pacities are lower because individual MEMS-based 
storage devices are much smaller than disks. Be- 
cause the mechanically-positioned MEMS compo- 
nents have much smaller masses than correspond- 
ing disk parts, their random access times are in the 
hundreds of microseconds. For the default device 
parameters in this paper, the average random 4KB 
access time is 703 ps. 


2.2. Low-level data layout 


The storage media on the sled is divided into rect- 
angular regions as shown in Figure 3. Each region 
contains MxN bits (e.g., 2500x2500) and is accessi- 
ble by exactly one probe tip; the number of regions 
on the media equals the number of probe tips. Each 
term in the nomenclature below is defined both in 
the text and visually in Figure 4. 


Cylinders. Drawing on the analogy to disk termi- 
nology, we refer to a cylinder as the set of all bits 
with identical z offset within a region (i.e., at identi- 
cal sled displacement in X). In other words, a cylin- 
der consists of all bits accessible by all tips when the 
sled moves only in the Y dimension, remaining im- 
mobile in the X dimension. Cylinder 1 is highlighted 
in Figure 4 as the four circled columns of bits. This 
definition parallels that of disk cylinders, which con- 
sist of all bits accessible by all heads while the arm 
remains immobile. There are M cylinders per sled. 
In our default model, each sled has 2500 cylinders 
that each hold 1350 KB of data. 


Tracks. A MEMS-based storage device might have 
6400 tips underneath its media sled; however, due 


Media: IY Actuator: 
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Figure 2: The movable media sled. The actuators, 
spring suspension, and the media sled are shown. An- 
chored regions are solid and the movable structure is 
shaded grey. 


to power and heat considerations it is unlikely that 
all 6400 tips can be active (accessing data) concur- 
rently. We expect to be able to activate 200-2000 
tips at a time. To account for this limitation, we 
divide cylinders into tracks. A track consists of all 
bits within a cylinder that can be read by a group 
of concurrently active tips. The sled in Figure 4 has 
sixteen tips (one per region; not all tips are shown), 
of which up to four can be concurrently active— 
each cylinder therefore has four tracks. Track 0 of 
cylinder 1 is highlighted in the figure as the leftmost 
circled column of bits. Note again the parallel with 
disks, where a track consists of all bits within a cylin- 
der accessible by a single active head. In our default 
model, each sled has 6400 tips and 1280 concurrently 
active tips, so each cylinder contains 5 tracks that 
each hold 270KB of data. Excluding positioning 
time, accessing an entire track takes 3.47 ms. 


Sectors. Continuing the disk analogy, tracks are 
divided into sectors. Instead of having each active 
tip read or write an entire vertical column of N bits, 
each tip accesses only 90 bits at a time—10 bits of 
servo/tracking information and 80 data bits (8 en- 
coded data bytes). Each 80-data-bit group forms 
an 8-byte sector, which is the smallest individually 
accessible unit of data on our MEMS-based storage 
device. Each track in Figure 4 contains 12 sectors 
(3 per tip). These sectors parallel the partitioning 
of disk tracks into sectors, with three notable differ- 
ences. First, disk sectors contain more data (e.g., 
512 bytes vs. 8 bytes). Second, MEMS-based stor- 
age devices can access multiple sectors concurrently: 
Figure 4 shows the four active tips accessing sec- 
tors 4, 5, 6, and 7. Third, MEMS-based storage de- 
vices can support bidirectional access, meaning that 
a data sector can be accessed in either the +Y or 
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Figure 3: Data organization on MEMS-based storage devices. The illustration depicts a small portion of the 
magnetic media sled. Each small rectangle outlines the media area accessible by a single probe tip, with a total of 16 
tip regions shown. A full device contains thousands of tips and tip regions. Each region stores MxN bits, organized 
into M vertical columns of N bits, alternating between servo/tracking information (10 bits) and data (80 bits =. 8 


encoded data bytes). To read or write data, the media sled passes over the tips in the +Y directions while the tips 
access the media. 


wv Denotes an active probe tip 
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Figure 4: Cylinders, tracks, sectors, and logical blocks. This erample shows a MEMS-based storage device with 
16 tips and Mx N = 8x 280. A “cylinder” is defined as all data at the same x offset within all regions; cylinder 1 is 
indicated by the four circled columns of bits. Each cylinder is divided into 4 “tracks” of 1080 bits, where each track is 
composed of four tips accessing 280 bits each. Each track is divided into 12 “sectors” of 80 bits each, with 10 bits of 
servo/tracking information between adjacent sectors and at the top and bottom of each track. (There are nine sectors 
in each tip region in this erample.) Finally, sectors are grouped together in pairs to form “logical blocks” of 16 bytes 


each. Sequential sector and logical block numbering are shown on the right. These definitions are discussed in detail 
in Section 2.2. 
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—Y direction. In our default model, each track is 
composed of 34,560 sectors of 8 bytes each, of which 
up to 1280 sectors can be accessed concurrently. Ex- 
cluding positioning time, each 1280 sector (10 KB) 
access takes 0.129 ms. 


Logical blocks. For the experiments in this paper, 
we combine groups of 64 sectors into SCSI-like logical 
blocks of 512 bytes each. Each logical block is there- 
fore striped across 64 tips, and up to 20 logical blocks 
can be accessed concurrently (1280+ 64 = 20). Dur- 
ing arequest, only those logical blocks needed to sat- 
isfy the request and any firmware-directed prefetch- 
ing are accessed; unused tips remain inactive to con- 
serve power. 


2.3. Media access characteristics 


Media access requires constant sled velocity in the Y 
dimension and zero velocity in the X dimension. The 
Y dimension access speed is a design parameter and 
is determined by the per-tip read and write rates, the 
bit cell width, and the sled actuator force. Although 
read and write data rates could differ, tractable con- 
trol logic is expected to dictate a single access veloc- 
ity in early MEMS-based storage devices. In our 
default model, the access speed is 28mm/s and the 
corresponding per-tip data rate is 0.7 Mbit/s. 


Positioning the sled for read or write involves sev- 
eral mechanical and electrical actions. To seek to a 
sector, the appropriate probe tips must be activated 
(to access the servo information and then the data), 
the sled must be positioned at the correct z,y dis- 
placement, and the sled must be moving at the cor- 
rect velocity for access. Whenever the sled seeks in 
the X dimension—i.e., the destination cylinder dif- 
fers from the starting cylinder—extra settling time 
must be taken into account because the spring-sled 
system oscillates in X after each cylinder-to-cylinder 
seek. Because this oscillation is large enough to 
cause off-track interference, a closed loop settling 
phase is used to damp the oscillation. To the first 
order, this active damping is expected to require a 
constant amount of time. Although slightly longer 
settling times may ultimately be needed for writes, 
as is the case with disks, we currently assume that 
the settling time is the same for both read and write 
requests. Settling time is not a factor in Y dimen- 
sion seeks because the oscillations in Y are subsumed 
by the large Y dimension access velocity and can be 
tolerated by the read/write channel. 


As the sled is moved away from zero displacement, 
the springs apply a restoring force toward the sled’s 
rest position. These spring forces can either improve 
or degrade positioning time (by affecting the effec- 


tive actuator force), depending on the sled displace- 
ment and direction of motion. This force is param- 
eterized in our simulator by the spring factor—the 
ratio of the maximum spring force to the maximum 
actuator force. A spring factor of 75% means that 
the springs pull toward the center with 75% of the 
maximum actuator force when the sled is at full dis- 
placement. The spring force decreases linearly to 0% 
as sled displacement approaches zero. The spring 
restoring force makes the acceleration of the sled 
a function of instantaneous sled position. In gen- 
eral, the spring forces tend to degrade the seek time 
of short seeks and improve the seek time of long 
seeks [11]. 


Large transfers may require that data from multi- 
ple tracks or cylinders be accessed. To switch tracks 
during large transfers, the sled switches which tips 
are active and performs a turnaround, using the 
actuators to reverse the sled’s velocity (e.g., from 
+28mm/s to —28mm/s). The turnaround time is 
expected to dominate any additional activity, such 
as the time to activate the next set of active tips, 
during both track and cylinder switches. One or two 
turnarounds are necessary for any seek in which the 
sled is moving in the wrong direction—away from 
the sector to be accessed—before or after the seek. 


2.4 Comparison to conventional disks 


Although MEMS-based storage devices involve some 
radically different technologies from disks, they 
share enough fundamental similarity for a disk-like 
model to be a sensible starting point. Like disks, 
MEMS-based storage devices stream data at a high 
rate and suffer a substantial distance-dependent po- 
sitioning time delay before each nonsequential ac- 
cess. In fact, although MEMS-based storage devices 
are much faster, they have ratios of request through- 
put to data bandwidth similar to those of disks from 
the early 1990s. Some values of the ratio, y, of re- 
quest service rate (IO/s) to streaming bandwidth 
(MB/s) for some recent disks include y = 26 (1989) 
for the CDC Wren-IV [21], y = 17 (1993) [12], and 
7 = 5.2 (1999) for the Quantum Atlas 10K [22]. y for 
disks continue to drop over time as bandwidth im- 
proves faster than mechanical positioning times. In 
comparison, the MEMS-based storage device in this 
paper yields y = 19 (14221O/s + 76 MB/s), compa- 
rable to disks within the last decade. Also, although 
many probe tips access the media in parallel, they 
are all limited to accessing the same relative z,y off- 
set within a region at any given point in time—recall 
that the media sled moves freely while the probe tips 
remain relatively fixed. Thus, the probe tip paral- 
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lelism provides greater data rates but not concur- 
rent, independent accesses. There are alternative 
physical device designs that would support greater 
access concurrency and lower positioning times, but 
at substantial cost in capacity [11]. 


The remainder of this section enumerates a num- 
ber of relevant similarities and differences between 
MEMS-based storage devices and conventional disk 
drives. With each item, we also discuss consequences 
for device management issues and techniques. 


Mechanical positioning. Both disks and MEMS- 
based storage devices have two main components of 
positioning time for each request: seek and rota- 
tion for disks, X and Y dimension seeks for MEMS- 
based storage devices. The major difference is that 
the disk components are independent (i.e., desired 
sectors rotate past the read/write head periodically, 
independent of when seeks complete), whereas the 
two components are explicitly done in parallel for 
MEMS-based storage devices. As a result, total 
positioning time for MEMS-base storage equals the 
greater of the X and Y seek times, making the lesser 
time irrelevant. The effect of this overlap on request 
scheduling is discussed in Section 4.2. 


Settling time. For both disks and MEMS-based 
storage devices, it is necessary for read/write heads 
to settle over the desired track after a seek. Set- 
tling time for disks is a relatively small component 
of most seek times (0.5ms of 1-15 ms seeks). How- 
ever, settling time for MEMS-based storage devices 
is expected to be a relatively substantial component 
of seek time (0.2ms of 0.2-0.8ms seeks). Because 
the settling time is generally constant, this has the 
effect of making seek times more constant, which in 
turn could reduce (but not eliminate) the benefit of 
both request scheduling and data placement. Sec- 
tion 4.3 discusses this issue. 


Logical-to-physical mappings. As with disks, 
we expect the lowest-level mapping of logical block 
numbers (LBWNs) to physical locations to be straight- 
forward and optimized for sequential access; this will 
be best for legacy systems that use these new de- 
vices as disk replacements. Such a sequentially op- 
timized mapping scheme fits disk terminology and 
has some similar characteristics. Nonetheless, the 
physical differences will make data placement deci- 
sions (mapping of file or database blocks to LBNs) 
an interesting topic. Section 5 discusses this issue. 


Seek time vs. seek distance. For disks, seek 
times are relatively constant functions of the seek 
distance, independent of the start cylinder and di- 
rection of seek. Because of the spring restoring 


forces, this is not true of MEMS-based storage de- 
vices. Short seeks near the edges take longer than 
they do near the center (as discussed in Section 5). 
Also, turnarounds near the edges take either less 
time or more, depending on the direction of sled mo- 
tion. As a result, seek-reducing request scheduling 
algorithms [34] may not achieve their best perfor- 
mance if they look only at distances between LBNs 
as they can with disks. 


Recording density. Some MEMS-based storage 
devices use the same basic magnetic recording tech- 
nologies as disks [1]. Thus, the same types of fab- 
rication and grown media defects can be expected. 
However, because of the much higher bit densities of 
MEMS-based storage devices, each such media de- 
fect will affect a much larger number of bits. This 
is one of the fault management issues discussed in 
Section 6.1. 


Numbers of mechanical components. MEMS- 
based storage devices have many more distinct me- 
chanical parts than disks. Although their very small 
movements make them more robust than the large 
disk mechanics, their sheer number makes it much 
more likely that some number of them will break. In 
fact, manufacturing yields may dictate that the de- 
vices operate with some number of broken mechan- 
ical components. Section 6.1 discusses this issue. 


Concurrent read/write heads. Because it is dif- 
ficult and expensive for drive manufacturers to en- 
able parallel activity, most modern disk drives use 
only one read/write head at a time for data ac- 
cess. Even drives that do support parallel activity 
are limited to only 2-20 heads. On the other hand, 
MEMS-based storage devices (with their per-tip ac- 
tuation and control components) could theoretically 
use all of their probe tips concurrently. Even after 
power and heat considerations, hundreds or thou- 
sands of concurrently active probe tips is a realistic 
expectation. This parallelism increases media band- 
width and offers opportunities for improved reliabil- 
ity. Section 6.1 discusses the latter. 


Control over mechanical movements. Unlike 
disks, which rotate at constant velocity independent 
of ongoing accesses, the mechanical movements of 
MEMS-based storage devices can be explicitly con- 
trolled. As a result, access patterns that suffer sig- 
nificantly from independent rotation can be better 
served. The best example of this is repeated access 
to the same block, as often occurs for synchronous 
metadata updates or read-modify-write sequences. 
This difference is discussed in Section 6.2. 
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Startup activities. Like disks, MEMS-based stor- 
age devices will require some time to ready them- 
selves for media accesses when powered up. How- 
ever, because of the size of their mechanical struc- 
tures and their lack of rotation, the time and power 
required for startup will be much less than for disks. 
The consequences of this fact for both availability 
(Section 6.3) and power management (Section 7) are 
discussed in this paper. 


Drive-side management. As with disks, manage- 
ment functionality will be split between host OSes 
and device OSes (firmware). Over the years, increas- 
ing amounts of functionality have shifted into disk 
firmware, enabling a variety of portability, reliabil- 
ity, mobility, performance, and scalability enhance- 
ments. We expect a similar trend with MEMS-based 
storage devices, whose silicon implementations offer 
the possibility of direct integration of storage with 
computational logic. 


Speed-matching buffers. As with disks, MEMS- 
based storage devices access the media as the sled 
moves past the probe tips at a fixed rate. Since this 
rate rarely matches that of the external interface, 
speed-matching buffers are important. Further, be- 
cause sequential request streams are important as- 
pects of many real systems, these speed-matching 
buffers will play an important role in prefetching 
and then caching of sequential LBNs. Also, as with 
disks, most block reuse will be captured by larger 
host memory caches instead of in the device cache. 


Sectors per track. Disk media is organized as a 
series of concentric circles, with outer circles hav- 
ing larger circumferences than inner circles. This 
fact led disk manufacturers to use banded (zoned) 
recording in place of a constant bit-per-track scheme 
in order to increase storage density and bandwidth. 
For example, banded recording results in a 3:2 ratio 
between the number of sectors on the outermost (334 
sectors) and innermost (229 sectors) tracks in the 
Quantum Atlas 10K [8]. Because MEMS-based stor- 
age devices instead organize their media in fixed-size 
columns, there is no length difference between tracks 
and banded recording is not relevant. Therefore, 
block layout techniques that try to exploit banded 
recording will not provide benefit for these devices. 
On the other hand, for block layouts that try to 
consider track boundaries and block offsets within 
tracks, this uniformity (which was common in disks 
10 or more years ago) will simplify or enable correct 
implementations. The subregioned layout described 
in Section 5 is an example of such a layout. 





device capacity 

number of tips 

maximum concurrent tips 
sled acceleration 

sled access speed 
constant settling time 
spring factor 

per-tip data rate 

media bit cell size 

bits per tip region (MxN) 
data encoding overhead 2 bits per byte 
servo overhead per 8 bytes 10 bits (11%) 
command processing overhead | 0.2 ms/request 
on-board cache memory 

external bus bandwidth 





803.6 m/s” 
28 mm/s 
0.22 ms 
75% 
0.7 Mbit/s 


40x40nm 
2500 x 2440 


Table 1: Default MEMS-based storage device pa- 
rameters. N=2440 in order to fit an integral number of 
80-bit encoded sectors (with inter-sector servo) in each 
column of bits. The default model includes no on-board 
caching (or prefetching), but does assume speed-matching 
buffers between the tips and the external bus. 


3 Experimental setup 


The experiments in this paper use the performance 
model for MEMS-based storage described in Refer- 
ence {11], which includes all of the characteristics de- 
scribed above. Although it is not yet possible to val- 
idate the model against real devices, both the equa- 
tions and the default parameters are the result of 
extensive discussions with groups that are designing 
and building MEMS-based storage devices [2, 3, 20]. 
We therefore believe that the model is sufficiently 
representative for the insights gained from experi- 
ments to be useful. Table 1 shows default parame- 
ters for the MEMS-based storage device simulator. 


This performance model has been integrated into 
the DiskSim simulation environment [10] as a disk- 
like storage device accessed via a SCSI-like protocol. 
DiskSim provides an infrastructure for exercising the 
device model with various synthetic and trace-based 
workloads. DiskSim also includes a detailed, vali- 
dated disk module that can accurately model a va- 
riety of real disks. For reference, some experiments 
use DiskSim’s disk module configured to emulate the 
Quantum Atlas 10K, one of the disks for which pub- 
licly available configuration parameters have been 
calibrated against real-world drives [8]. The Quan- 
tum Atlas 10K has a nominal rotation speed of 
10,000 RPM, average seek time of 5.0 ms, streaming 
bandwidth of 17.3-25.2 MB/s, and average random 
single-sector access time of 8.5 ms [22]. 





4th Symposium on Operating Systems Design and Implementation 


233 


234 


Some of the experiments use a _ synthetically- 
generated workload that we refer to as the Random 
workload. For this workload, request inter-arrival 
times are drawn from an exponential distribution; 
the mean is varied to simulate a range of workloads. 
All other aspects of requests are independent: 67% 
are reads, 33% are writes, the request size distribu- 
tion is exponential with a mean of 4 KB, and request 
starting locations are uniformly distributed across 
the device’s capacity. 


For more realistic workloads, we use two traces of 
real disk activity: the TPC-C trace and the Cello 
trace. The TPC-C trace comes from a TPC-C 
testbed, consisting of Microsoft SQL Server atop 
Windows NT. The hardware was a 300 MHz Intel 
Pentium II-based system with 128 MB of memory 
and a 1 GB test database striped across two Quan- 
tum Viking disk drives. The trace captures one hour 
of disk activity for TPC-C, and its characteristics are 
described in more detail in Reference [23]. The Cello 
trace comes from a Hewlett-Packard system running 
the HP-UX operating system. It captures disk ac- 
tivity from a server at HP Labs used for program 
development, simulation, mail, and news. While the 
total trace is actually two months in length, we re- 
port data for a single, day-long snapshot. This trace 
and its characteristics are described in detail in Ref- 
erence [25]. When replaying the traces, each traced 
disk is replaced by a distinct simulated MEMS-based 
storage device. 


As is often the case in trace-based studies, our simu- 
lated devices are newer and significantly faster than 
the disks used in the traced systems. To explore 
a range of workload intensities, we replicate an ap- 
proach used in previous disk scheduling work [34]: 
we scale the traced inter-arrival times to produce a 
range of average inter-arrival times. When the scale 
factor is one, the request inter-arrival times match 
those of the trace. When the scale factor is two, the 
traced inter-arrival times are halved, doubling the 
average arrival rate. 


4 Request scheduling 


An important mechanism for improving disk effi- 
ciency is deliberate scheduling of pending requests. 
Request scheduling improves efficiency because po- 
sitioning delays are dependent on the relative po- 
sitions of the read/write head and the destination 
sector. The same is true of MEMS-based storage 
devices, whose seek times are dependent on the dis- 
tance to be traveled. This section explores the im- 
pact of different scheduling algorithms on the per- 
formance of MEMS-based storage devices. 
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4.1 Disk scheduling algorithms 


Many disk scheduling algorithms have been devised 
and studied over the years. Our comparisons focus 
on four. First, the simple FCFS (first-come, first- 
served) algorithm often results in suboptimal perfor- 
mance, but we include it for reference. The SSTF 
(shortest seek time first) algorithm was designed to 
select the request that will incur the smallest seek 
delay [4], but this is rarely the way it functions in 
practice. Instead, since few host OSes have the in- 
formation needed to compute actual seek distances 
or predict seek times, most SSTF implementations 
use the difference between the last accessed LBN and 
the desired LBN as an approximation of seek time. 
This simplification works well for disk drives [34], 
and we label this algorithm as SSTF_LBN. The 
CLOOK_LBN (cyclical look) algorithm services re- 
quests in ascending LBN order, starting over with 
the lowest LBN when all requests are “behind” the 
most recent request [28]. The SPTF (shortest po- 
sitioning time first) policy selects the request that 
will incur the smallest positioning delay [14,29]. For 
disks, this algorithm differs from others in that it 
explicitly considers both seek time and rotational 
latency. 


For reference, Figure 5 compares these four disk 
scheduling algorithms for the Atlas 10K disk drive 
and the Random workload (Section 3) with a range 
of request arrival rates. Two common metrics for 
evaluating disk scheduling algorithms are shown. 
First, the average response time (queue time plus 
service time) shows the effect on average perfor- 
mance. As expected, FCFS saturates well be- 
fore the other algorithms as the workload increases. 
SSTF_LBN outperforms CLOOK_LBN, and SPTF 
outperforms all other schemes. Second, the squared 
coefficient of variation (07/2) is a metric of “fair- 
ness” (or starvation resistance) [30,34]; lower val- 
ues indicate better starvation resistance. As ex- 
pected, CLOOK-LBN avoids the starvation effects 
that characterize the SSTF_LBN and SPTF algo- 
rithms. Although not shown here, age-weighted 
versions of these greedy algorithms can reduce re- 
quest starvation without unduly reducing average 
case performance [14, 29]. 


4.2 MEMS-based storage scheduling 


Existing disk scheduling algorithms can be adapted 
to MEMS-based storage devices once these devices 
are mapped onto a disk-like interface. Most al- 
gorithms, including SSTF_LBN and CLOOK-_LBN, 
only use knowledge of LBNs and assume that dif- 
ferences between LBNs are reasonable approxima- 
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Figure 5: Comparison of scheduling algorithms for the Random workload on the Quantum Atlas 10K 


disk. 


tions of positioning times. SPTF, which addresses 
disk seeks and rotations, is a more interesting case. 
While MEMS-based storage devices do not have a 
rotational latency component, they do have two po- 
sitioning time components: the X dimension seek 
and the Y dimension seek. As with disks, only one 
of these two components (seek time for disks; the X 
dimension seek for MEMS-based storage devices) is 
approximated well by a linear LBN space. Unlike 
disks, the two positioning components proceed in 
parallel, with the greater subsuming the lesser. The 
settling time delay makes most X dimension seek 
times larger than most Y dimension seek times. Al- 
though it should never be worse, SPTF will only be 
better than SSTF (which minimizes X movements, 
but ignores Y) when the Y component is frequently 
the larger. 


Figure 6 shows how well these algorithms work 
for the default MEMS-based storage device on the 
Random workload with a range of request arrival 
rates. In terms of both performance and starva- 
tion resistance, the algorithms finish in the same 
order as for disks: SPTF provides the best per- 
formance and CLOOK-_LBN provides the best star- 
vation resistance. However, their performance rel- 
ative to each other merits discussion. The differ- 
ence between FCFS and the LBN-based algorithms 
(CLOOK-LBN and SSTF _LBN) is larger for MEMS- 
based storage devices because the seek time is a 
much larger component of the total service time. In 
particular, there is no subsequent rotational delay. 
Also, the average response time difference between 
CLOOK-_LBN and SSTF_LBN is smaller for MEMS- 
based storage devices, because both algorithms re- 
duce the X seek times into the range where X and Y 
seek times are comparable. Since neither addresses 


Y seeks, the greediness of SSTF _LBN is less effective. 
SPTF obtains additional performance by addressing 
Y seeks. 


Figures 7(a) and 7(b) show how the scheduling al- 
gorithms perform for the Cello and TPC-C work- 
loads, respectively. The relative performance of the 
algorithms on the Cello trace is similar to the Ran- 
dom workload. The overall average response time 
for Cello is dominated by the busiest one of Cello’s 
eight disks; some of the individual disks have differ- 
ently shaped curves but still exhibit the same order- 
ing among the algorithms. One noteworthy differ- 
ence between TPC-C and Cello is that SPTF out- 
performs the other algorithms by a much larger mar- 
gin than for TPC-C at high loads. This occurs be- 
cause the scaled-up version of the workload includes 
many concurrently-pending requests with very small 
LBN distances between adjacent requests. LBN- 
based schemes do not have enough information to 
choose between such requests, often causing small 
(but expensive) X-dimension seeks. SPTF addresses 
this problem and therefore performs much better. 


4.3 SPTF and settling time 


Originally, we had expected SPTF to outperform the 
other algorithms by a greater margin for MEMS- 
based storage devices. Our investigations suggest 
that the value of SPTF scheduling is highly depen- 
dent upon the settling time component of X dimen- 
sion seeks. With large settling times, X dimension 
seek times dominate Y dimension seek times, mak- 
ing SSTF_LBN match SPTF. With small settling 
times, Y dimension seek times are a more signifi- 
cant component. To illustrate this, Figure 8 com- 
pares the scheduling algorithms with the constant 
settling time set to zero and 0.44 ms (double the de- 
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Figure 6: Comparison of scheduling algorithms for the Random workload on the MEMS-based storage 
device. Note the scale of the X aris has increased by an order of magnitude relative to the graphs in Figure 5. 
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Figure 7: Comparison of scheduling algorithms for the Cello and TPC-C workloads on the MEMS-based 
storage device. 
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Figure 8} Comparison of average performance of the Random workload for zero and double constant 
settling time on the MEMS-based storage device. These are in comparison to the default model (Random 
with constant settling time of 0.22ms) shown in Figure 6(a). With no settling time, SPTF significantly outperforms 
CLOOK_LBN and SSTF_LBN. With the doubled settling time, CLOOK_LBN, SSTF_LBN, and SPTF are nearly 
identical. 
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Figure 9: Difference in request service time for subregion accesses. This figure divides the region accessible by 
an individual probe tip into 25 subregions, each 500x 500 bits. Each bor shows the average request service time (in 
milliseconds) for random requests starting and ending inside that subregion. The upper numbers represent the service 
time when the default settling time is included in calculations; numbers in italics represent the service time for zero 
settling time. Note that the service time differs by 14-21% between the centermost and outermost subregions. 


fault value). As expected, SSTF_LBN is very close 
to SPTF when the settling time is doubled. With 
zero settling time, SPTF outperforms the other al- 
gorithms by a large margin. 


5 On-device data layout 


Space allocation and data placement for disks con- 
tinues to be a ripe topic of research. We expect the 
same to be true of MEMS-based storage devices. In 
this section, we discuss how the characteristics of 
MEMS-based storage positioning costs affect place- 
ment decisions for small local accesses and large se- 
quential transfers. A bipartite layout is proposed 
and shown to have potential for improving perfor- 
mance. 


5.1 Small, skewed accesses 


As with disks, short distance seeks are faster than 
long distance seeks. Unlike disks, MEMS-based stor- 
age devices’ spring restoring forces make the effective 
actuator force (and therefore sled positioning time) 
a function of location. Figure 9 shows the impact of 
spring forces for seeks inside different “subregions” 
of a single tip’s media region. The spring forces in- 
crease with increasing sled displacement from the 
origin (viz., toward the outermost subregions in Fig- 
ure 9), resulting in longer positioning times for short 
seeks. As a result, distance is not the only compo- 
nent to be considered when finding good placements 
for small, popular data items—offset relative to the 
center should also be considered. 


5.2 Large, sequential transfers 


Streaming media transfer rates for MEMS-based 
storage devices and disks are similar: 17.3- 
25.2MB/s for the Atlas 10K [22]; 75.9 MB/s for 
MEMS-based storage devices. Positioning times, 
however, are an order of magnitude shorter for 
MEMS-based storage devices than for disks. This 
makes positioning time relatively insignificant for 
large transfers (e.g., hundreds of sectors). Figure 10 
shows the request service times for a 256KB read 
with respect to the X distance between the initial 
and final sled positions. Requests traveling 1250 
cylinders (e.g., from the sled origin to maximum sled 
displacement) incur only a 10% penalty. This lessens 
the importance of ensuring locality for data that will 
be accessed in large, sequential chunks. In contrast, 
seek distance is a significant issue with disks, where 
long seeks more than double the total service time 
for 256 KB requests. 


5.3. A data placement scheme for 
MEMS-based storage devices 


To take advantage of the above characteristics, 
we propose a 25-subregion bipartite layout scheme. 
Small data are placed in the centermost subregions; 
long, sequential streaming data are placed in outer 
subregions. Two layouts are tested: a five-by-five 
grid of subregions (Figure 9) and a simple columnar 
division of the LBN space into 25 columns (viz., col- 
umn 0 contains cylinders 0-99, column 1 contains 
cylinders 100-199, etc.). 
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Figure 10: Large (256KB) request service time 


vs. X seek distance for MEMS-based storage de- 
vices. Because the media access time is large relative 
to the positioning time, seeking the mazimum distance 


in X increases the service time for large requests by only 
12%. 


We compare these layout schemes against the “or- 
gan pipe” layout [26, 32], an optimal disk-layout 
scheme, assuming no inter-request dependencies. In 
the organ pipe layout, the most frequently accessed 
files are placed in the centermost tracks of the disk. 
Files of decreasing popularity are distributed to ei- 
ther side of center, with the least frequently accessed 
files located closer to the innermost and outermost 
tracks. Although this scheme is optimal for disks, 
files must be periodically shuffled to maintain the 
frequency distribution. Further, the layout requires 
some state to be kept, indicating each file’s popular- 
ity. 

To evaluate these layouts, we used a workload of 
10,000 whole-file read requests whose sizes are drawn 
from the file size distribution reported in Refer- 
ence [9]. In this size distribution, 78% of files are 
8KB or smaller, 4% are larger than 64KB, and 
0.25% are larger than 1 MB. For the subregioned and 
columnar layouts, the large files (larger than 8 KB) 
were mapped to the ten leftmost and ten rightmost 
subregions, while the small files (8 KB or less) were 
mapped to the centermost subregion. To conserva- 
tively avoid second-order locality within the large or 
small files, we assigned a random location to each re- 
quest within either the large or the small subregions. 
For the organ pipe layout, we used an exponential 
distribution to determine file popularity, which was 
then used to place files. 


Figure 11 shows that all three layout schemes achieve 
a 12-15% improvement in average access time over a 
simple random file layout. Subregioned and colum- 
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Figure 11: Comparison of layout schemes for 
MEMS-based storage devices. For the default de- 
vice, the organ pipe, subregioned, and columnar layouts 
achieve a 12-15% performance improvement over a ran- 
dom layout. Further, for the “settling time = 0” case, 
the subregioned layout outperforms the others by an ad- 
ditional 12%. It is interesting to note that an optimal 
disk layout technique does not necessarily provide the best 
performance for MEMS-based storage. 


nar layouts for MEMS-based storage devices match 
organ pipe, even with the conservative model and 
no need for keeping popularity data or periodically 
reshuffling files on the media. For the “no settling 
time” case, the subregioned layout provides the best 
performance as it addresses both X and Y. 


6 Failure management 


Fault tolerance and recoverability are significant 
considerations for storage systems. Although 
MEMS-based storage devices are not yet available, 
MEMS components have been built and tested for 
many years. Their miniature size and movements 
will make MEMS-based storage components less 
fragile than their disk counterparts [17]. Still, there 
will likely be more defective or failed parts in MEMS- 
based storage because of the large number of distinct 
components compared to disks and the fact that bad 
parts cannot be replaced before or during assembly. 


Although failure management for MEMS-based stor- 
age devices will be similar to failure management 
for conventional disks, there are several important 
differences. One is that individual component fail- 
ures must be made less likely to render a device 
inoperable than in disks. Another is that MEMS- 
based storage devices simplify some aspects of fail- 
ure management—inter-device redundancy mainte- 
nance and device restart, for example. This section 
discusses three aspects of failure management: in- 
ternal faults, device failures, and recoverability from 
system crashes. 
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6.1 Internal faults 


The common failure modes for disk drives include re- 
coverable failures (for example, media defects or seek 
errors) and non-recoverable failures (head crashes, 
motor or arm actuator failure, drive electronics or 
channel failure). MEMS-based storage devices have 
similar failure modes with analogous causes. How- 
ever, the ability to incorporate multiple tips into fail- 
ure tolerance schemes allows MEMS-based storage 
devices to mask most component failures, including 
many that would render a disk inoperable. 


Specifically, powerful error correcting codes can be 
computed over data striped across multiple tips. In 
our default model, each 512 byte logical block and 
its ECC are striped across 64 tips. This ECC can 
include both a horizontal component (across tips) 
and a vertical component (within a single sector). 
The horizontal ECC can recover from missing sec- 
tors. The vertical ECC identifies sectors that should 
be treated as missing—with the effect of converting 
some large errors into erasures, which can more eas- 
ily be handled by the horizontal ECC. This single 
mechanism addresses most internal failures that are 
recoverable. 


Media defects. In disk drives, unrecoverable me- 
dia defects are handled by remapping logical block 
numbers to non-defective locations, with data of- 
ten being lost when defects “grow” during opera- 
tion. In MEMS-based storage, most media defects 
are expected to affect the data under a small num- 
ber of tips (e.g., 1-4). Therefore, the horizontal ECC 
can usually be used to reconstruct unavailable bits. 
This capability is particularly important because 
the higher density of MEMS-based storage causes 
a given defect to affect more bits than it would in 
a disk. Tolerance of large media defects can be fur- 
ther extended by spreading each logical block’s data 
and ECC among physically distant tips—graph col- 
oring schemes excel at the types of data mappings 
required. 


Tip failures. Failure of a conventional disk’s 
read/write head or control logic generally renders 
the entire device inoperable. MEMS-based stor- 
age replicates these functions across thousands of 
components. With so many components, failure of 
one or more is not only possible, but probable— 
individual probe tips can break off or “crash” into 
the media, and fabrication variances will produce 
faulty tips or faulty tip-specific logic. Most such 
problems can be handled using the same mechanisms 
that handle media failures, since failure of a tip or 
its associated control logic translates into unavail- 


ability of data in the corresponding tip region. The 
horizontal ECC can be used to reconstruct this data. 


As with disk drives, spare space needs to be with- 
held from the fault-free mapping of data to physical 
locations in MEMS-base storage. This spare space is 
used to store data that cannot be stored at its default 
physical location because of media or tip failures. 
The parallel operation of tips within a track pro- 
vides an opportunity to avoid the performance and 
predictability penalties normally associated with de- 
fect remapping in disk drives. Specifically, by setting 
aside one or more spare tips in each track, unread- 
able sectors can be remapped to the same sector 
under a spare tip. A sector remapped in this way 
would be accessed at exactly the same time as the 
original (unavailable) sector would have been. In 
contrast, disks “slip” LBNs over defective sectors or 
re-map them to spare sectors elsewhere in a cylinder 
or zone, changing their access times relative to their 
original locations. 


6.2 Device failures 


MEMS-based storage devices are susceptible to sim- 
ilar non-recoverable failures as disk drives: strong 
external mechanical or electrostatic forces can dam- 
age the actuator comb fingers or snap off the springs, 
manufacturing defects can surface, or the device 
electronics or channel can fail. These failures should 
appear and be handled in the same manner as for 
disks. For example, appropriate mechanisms for 
dealing with device failures include inter-device re- 
dundancy and periodic backups. 


Interestingly, MEMS-based storage’s mechanical 
characteristics are a better match than those of 
disks for the common read-modify-write operations 
used in some fault-tolerant schemes (e.g., RAID- 
5). Whereas conventional disks suffer a full rotation 
to return to the same sector, MEMS-based storage 
devices can quickly reverse direction, significantly 
reducing the read-modify-write latency (Table 2). 
For the Random workload, a five-disk RAID-5 sys- 
tem has 77% longer response times than a four-disk 
striping-only system (14.3ms vs. 8.04ms); the la- 
tency increase for MEMS-based storage devices is 
only 27% (1.36 ms vs. 1.07 ms). 


6.3. Recovery from host system crashes 


File systems and databases must maintain inter- 
nal consistency among persistent objects stored on 
MEMS-based storage devices, just as they do for ob- 
jects on disks. Although synchronous writes will still 
hurt performance, the low service times of MEMS- 
based storage devices will lessen the penalty. 
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L__] Atlas 10K MEMS] 
# sectors | 8 | 334 | 8 | 334 | 


read 6.00 0.13 | 2.19 
0.00 0.07 | 0.07 
4.45 


6.00 

Table 2: A comparison of read-modify-write times 
for 4 KB (8 sector) and disk track-length (334 sec- 
tor) transfers. Conventional disks must wait for a 
complete platter rotation during read-modify-write oper- 
ations, whereas MEMS-based storage devices need only 
perform a turnaround, a relatively inexpensive opera- 
tion. This characteristic is particularly helpful for code- 
based redundancy schemes (for ecample, RAID-5) or for 
verify-after-write operations. 


reposition 
write 
total (ms) 
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Another relevant characteristic of MEMS-based 
storage devices is rapid device startup. Since no 
spindle spin-up time is required, startup is almost 
immediate—estimated at less than 0.5ms. In con- 
trast, high-end disk drives can take 15~25 seconds 
before spin-up and initialization is complete [22]. 
Further, MEMS-based storage devices do not ex- 
hibit the power surge inherent in spinning up disk 
drives, so power spike avoidance techniques (e.g., 
serializing the spin-up of multiple disk drives) are 
unnecessary—all devices can be started simultane- 
ously. Combined, these effects could reduce system 
restart times from minutes to milliseconds. 


7 Other considerations 


This section discusses additional issues related to 
our exploration of OS management for MEMS-based 
storage devices. 


Power conservation. Significant effort has gone 
into reducing a disk drive’s power consumption, such 
as reducing active power dissipation and introduc- 
ing numerous power-saving modes for use during 
idle times [6, 15,16]. MEMS-based storage devices 
are much more energy efficient than disks in terms 
of operational power. Further, the physical char- 
acteristics of MEMS-based storage devices enable 
a simpler power management scheme: a single idle 
mode that stops the sled and powers down all non- 
essential electronics. With no rotating parts and lit- 
tle mass, the media sled’s restart time is very small 
(estimated at under 0.5ms). This relatively small 
penalty enables aggressive idle mode use, switching 
from active to idle as soon as the I/O queue is empty. 
Detailed energy breakdown and evaluation indicates 
that our default MEMS-based storage device em- 
ploying this immediate-idle scheme would dissipate 
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only 8-22% of the energy used by today’s low-power 
disk drives [27]. 


Alternate seek control models. Our device 
model assumes that seeks are accomplished by a 
“slew plus settle” approach, which involves maxi- 
mum acceleration for the first: portion of the seek, 
followed by maximum deceleration to the destina- 
tion point and speed, followed by a closed loop set- 
tling time. With such seek control, the slew time 
goes up as the square root of the distance and the 
settling time is constant (to the first order). The al- 
ternate seek control approach, a linear system seek, 
would incorporate rate proportional feedback to pro- 
vide damping and a step input force to initiate move- 
ment to a desired location and velocity. Seeks based 
on such a control system exhibit longer seek times 
(including the settling times) that are much more 
dependent on seek distance [35]. This should not 
change our high-level conclusions, but will tend to 
increase the importance of both SPTF scheduling 
and subregion data layouts. 


Erase cycles. Although our target MEMS-based 
storage device employs traditional rewriteable mag- 
netic media, some designs utilize media that must be 
reset before it can be overwritten. For example, the 
IBM Millipede [31] uses a probe technology based 
on atomic force microscopes (AFMs), which stores 
data by melting minute pits in a thin polymer layer. 
A prominent characteristic of the Millipede design 
is a block erase cycle requiring several seconds to 
complete. Such block erase requirements would ne- 
cessitate management schemes, like those used for 
Flash RAM devices [5], to hide erase cycle delays. 


8 Summary 


This paper compares and contrasts MEMS-based 
storage devices with disk drives and provides a foun- 
dation for focused OS management of these new de- 
vices. We describe and evaluate approaches for tun- 
ing request scheduling, data placement and failure 
management techniques to the physical characteris- 
tics of MEMS-based storage. 


One of the general themes of our results is that OS 
management of MEMS-based storage devices can be 
similar to, and simpler than, management of disks. 
For example, disk scheduling algorithms can be 
adapted to MEMS-based storage devices in a fairly 
straightforward manner. Also, performance is much 
less dependent on such optimizations as careful data 
placement, which can yield order of magnitude im- 
provements for disk-based systems [9, 19,24]; data 
placement still matters, but sub-optimal solutions 
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may not be cause for alarm. In the context of avail- 
ability, internal redundancy can mask most prob- 
lems, eliminating both data loss and performance 
loss consequences common to disk drives. Similarly, 
rapid restart times allow power-conservation soft- 
ware to rely on crude estimates of idle time. 


We continue to explore the use of MEMS-based stor- 
age devices in computer systems, including their 
roles in the memory hierarchy [27] and in enabling 
new applications. 
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Trading Capacity for Performance in a Disk Array* 
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Randolph Y. Wangt Kai Lil 


Abstract 


A variety of performance-enhancing techniques, 
such as striping, mirroring, and rotational data repli- 
cation, exist in the disk array literature. Given a 
fixed budget of disks, one must intelligently choose 
what combination of these techniques to employ. 
In this paper, we present a way of designing disk 
arrays that can flexibly and systematically reduce 
seek and rotational delay in a balanced manner. We 
give analytical models that can guide an array de- 
signer towards optimal configurations by considering 
both disk and workload characteristics. We have im- 
plemented a prototype disk array that incorporates 
the configuration models. In the process, we have 
also developed a robust disk head position predic- 
tion mechanism without any hardware support. The 
resulting prototype demonstrates the effectiveness of 
the configuration models. 


1 Introduction 


In this paper, we set out to answer a simple ques- 
tion: how do we systematically increase the perfor- 
mance of a disk array by adding more disks? 

This question is motivated by two phenomena. 
The first is the presence of a wide variety of 
performance-enhancing techniques in the disk array 
literature. These include striping[17], mirroring[3], 
and replication of data within a track to improve ro- 
tational delay[18]. All of these techniques share the 
common theme of improving performance by scal- 
ing the number of disks. Their performance impacts, 
however, are different. Given a fixed budget of disks, 
an array designer faces the choice of what combina- 
tion of these techniques to use. 
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The second phenomenon is the increasing cost 
and performance gap between disk and memory. 
This increase is fueled by the explosive growth of 
disk areal density, which is at an annual rate of about 
60% [8]. On the other hand, the areal density of 
memory has been improving at a rate of only 40% 
per year [8]. The result is a cost gap of roughly two 
orders of magnitude today. 

As disk latency has been improving at about 
only 10% per year [8], disks are becoming increas- 
ingly unbalanced in terms of the relationship be- 
tween capacity and latency. Although cost per byte 
and capacity per drive remain the predominant con- 
cerns of a large sector of the market, a substantial 
performance-sensitive (and, in particular, latency- 
sensitive) market exists. Database vendors today 
have already recognized the importance of building 
a balanced secondary storage system. For example, 
in order to achieve high performance on TPC-C [26], 
vendors configure systems based on the number of 
disk heads instead of capacity. To achieve D times 
the bandwidth, the heads form a D-way mirror, a D- 
way stripe, or a RAID-10 configuration [4, 11, 25], 
which combines mirroring and striping so that each 
unit of the striped data is also mirrored. What is 
not well understood is how to configure the heads to 
get the most out of them. 

The key contributions of this paper are: 

e a flexible strategy for configuring disk arrays and 
its performance models, 

e a software-only disk head position prediction 
mechanism that enables a range of position- 
sensitive scheduling algorithms, and 

e evaluation of a range of alternative strategies 
that trade capacity for performance. 

More specifically, we present a disk array config- 
uration, called an SR-Array, that flexibly combines 
striping with rotational replication to reduce both 
seek and rotational delay. The power of this config- 
uration lies in that it can be flexibly adjusted in a 
balanced manner that takes a variety of parameters 
into consideration. We present a series of analyti- 
cal models that show how to configure the array by 
considering both disk and workload characteristics. 
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To evaluate the effectiveness of this approach, we 
have designed and implemented a prototype disk ar- 
ray that incorporates the SR-Array configurations. 
In the process, we have developed a method for 
predicting the disk head location. It works on a 
wide range of off-the-shelf hard drives without spe- 
cial hardware support. This mechanism is not only 
a crucial ingredient in the success of the SR-Array 
configurations, it also enables the implementation of 
rotational position sensitive scheduling algorithms, 
such as Shortest Access Time First (SATF) [14, 23], 
across the disk array. Because these algorithms in- 
volve inter-disk replicas, without the head-tracking 
mechanism, it would have been difficult to choose 
replicas intelligently even if the drives themselves 
perform sophisticated internal scheduling. 

Our experimental results demonstrate that the 
SR-Array provides an effective way of trading capac- 
ity for improved performance. For example, under 
one file system workload, a properly configured six- 
disk SR-Array delivers 1.23 to 1.42 times lower la- 
tency than that achieved on highly optimized strip- 
ing and mirroring systems. ‘The same SR-Array 
achieves 1.3 to 2.6 times better sustainable through- 
put while maintaining a 15 ms response time on this 
workload. 

The remainder of the paper is organized as fol- 
lows. Section 2 presents the SR-Array analytical 
models that guide configuration of disk arrays. Sec- 
tion 3 describes the integrated simulator and proto- 
type disk array that implement the SR-Array config- 
uration models. Section 4 details the experimental 
results. Section 5 describes some of the related work. 
Section 6 concludes. 


2 Techniques and Analytical Models 


In this section, we provide a systematic analysis 
of how a combination of the performance-enhancing 
techniques such as striping and data replication 
can contribute to seek distance reduction, rotational 
delay reduction, overall latency improvement, and 
throughput improvement. These analytical models, 
though approximations in some cases, serve as a ba- 
sis for configuring a disk array for a given workload. 


2.1 Reducing Seek Distance 


We start by defining the following abstract prob- 
lem: suppose the maximum seek distance on a single 
disk is S, the total amount of data fits on a single 
disk, and accesses are uniformly distributed across 
the data set. Then, how can we effectively employ 
D disks to reduce the average seek latency? We use 
seek distance to simplify our presentation. (Seek la- 
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Figure 1: Techniques for reducing seek distance. Capital 
letters represent a portion of the data. To the left of the 
arrows, we show how data is (logically) stored on a single 
disk. To the right, we show different ways that data on 
the single disk can be distributed on two disks (D = 2): 
(a) D-way mirroring, and (b) D-way striping. 


tency is approximately a linear function of seek dis- 
tance only for long seeks [22].) As a base case, one 
can show that the average seek distance for reads on 
a single disk [24] is S$; = S/3. 

The first seek reduction technique is D-way mir- 
roring (shown in Figure 1(a)). D-way mirroring can 
reduce seek distance because we can choose the disk 
head that is closest to the target sector in terms of 
seek distance. With D disks, the average seek dis- 
tance is the average of the minimum of D random 
variables [3], which is S/(2D +1). 

The second technique is striping (and keeping 
disks partially empty). Figure 1(b) illustrates a two- 
way striping. Data on the original single disk is par- 
titioned into two disjoint sets: B and C. We store B 
on the outer edge of the first disk and C on the outer 
edge of the second disk. The space in the middle of 
these two disks is not used. In this case, the sin- 
gle large disk is in effect split into two smaller disks. 
As a result, the disk head movement is restricted to 
a smaller region. Assuming constant track capacity 
and uniform accesses, Matloff [17] gives the average 
seek distance of a D-way stripe (S,): 


Ss 
8.(D) = = (1) 
The amount of seek reduction achieved by striping is 
better than that of D-way mirroring. However, D- 
way mirroring provides reliability through the use 
of multiple copies. A hybrid scheme would provide 
reliability along with smaller seek latencies. RAID- 
10, widely used in practice, is a concrete example of 
such a hybrid scheme: in a RAID-10 system, data 
is striped across D, disks while each block is also 
replicated on D,, different disks. 


2.2 Reducing Rotational Delay 


As we reduce the average seek distance, the rota- 
tional delay starts to dominate the disk access cost. 
To address this limitation, we replicate data at dif- 
ferent rotational positions, and by choosing a replica 
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Figure 2: Techniques for reducing rotational delay. (a) 
Randomly placed replicas. (b) Evenly spaced replicas. (c) 
Replicas placed on different tracks (either within a single 
disk or on different disks). 


that is rotationally closest to the disk head, we can 
reduce rotational delay. Replication for reducing ro- 
tational delay can increase seek distance by pushing 
data farther apart. We will discuss combining the 
techniques for reducing seek and rotation distance 
in a later section. 

If the time needed to complete a rotation on a 
single disk is R, we observe that the average rota- 
tional delay R,(1) is simply half of a full rotation, 
ie. R,(1) = R/2. If we replicate data D times, and 
spread the replicas evenly on a track (i.e. 360/D de- 
grees apart from each other as shown in Figure 2(b)), 
the average read rotational latency R, is: 


R,(D) = 3% (2) 


We can also show that the average read rotational 
latency is R, = R/(D+1), if we randomly place 
replicas (shown in Figure 2(a)) on the same track. 
This technique is therefore less beneficial than evenly 
distributing the replicas and is not used in our de- 
sign. 

However, having multiple replicas on one track 
increases average rotational latency R, for writing 
all these replicas to: 


Ry(D) =R- (3) 


2D 

Of course, we could reduce the write costs by 
writing the closest copy synchronously and prop- 
agating other copies during idle periods. Equa- 
tion (3) gives the worst case cost when we are not 
able to mask the replica propagation. Notice that 
R,(D) + R,(D) = R. Thus if reads are more fre- 
quent than writes, making more replicas will reduce 
overall latency. If reads and writes are equally fre- 
quent, varying D will not change the average over- 
all latency. If writes are more frequent than reads, 
the approach with no replication is always the best. 
Note that this relationship is independent of the 
value of R and is only true for foreground replica 
propagation. Background propagation may make 
replication desirable even when writes outnumber 
reads. 


Figures 2(a) and (b) illustrate the concept of 
rotational replication by making copies within the 
same track. Unfortunately, this decreases the band- 
width of large I/O as a result of shortening the ef 
fective track length and increasing track switch fre- 
quency. To avoid unnecessary track switches, we 
place the replicas on different tracks either within a 
cylinder of a single disk or on different disks (shown 
in Figure 2(c)). Track skews must be re-arranged so 
that large sequential I/Os that cross track bound- 
aries do not suffer any unnecessary degradation. 


2.3 Reducing Both Seek and Rotational 
Delay 


In the previous sections, we have discussed ex- 
isting techniques for reducing seek distance and ro- 
tational delay in isolation. Their combined effects, 
however, are not well understood. We now develop 
models that predict the overall latency as we increase 
the number of disks. 


SR-Array: Combining Striping and Rota- 
tional Replication 


Since disk striping reduces seek distance and rota- 
tional replication reduces rotational delay, we can 
combine the two techniques to reduce overall la- 
tency. We call the resulting configuration an SR- 
Array. Figure 3 shows an example SR-Array. In 
an SR-Array, we perform rotational replication on 
the same disk. We explore rotational replication on 
different disks in a later section. 

Given a fixed budget of D disks, we would now 
like to answer the following question: what degree 
of striping and what degree of rotational replication 
should we use for the best resulting performance? 
We call this the “aspect ratio” question. We first 
consider this question for random access latency, and 
then we examine how the model can be extended to 
take into account other workload parameters. 


Read Latency on an SR-Array 


In this paper, we define overhead to include various 
processing times, transfer costs, track switch time, 
and mechanical acceleration/deceleration times. We 
focus on the overhead-independent part of the la- 
tency in the following analysis. 

Let us assume that we have a single disk’s worth 
of data, and we have a total of D disks. Suppose 
the maximum seek time on one disk is S, the time 
for a full rotation is R, only 1/D, of the cylinders 
on a single disk is used to limit the seek distance, 
and D, is the number of replicas for reducing rota- 
tional delay (D,D, = D). If D, = 1, an SR-Array 
degenerates to simple striping and only 1/D of the 
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Figure 3: Reducing both seek and rotational delay in an 
SR-Array. (a) Data on an original disk is divided into 
siz parts. (b) A 3x 2 SR-Array. Each disk holds only 
one sizth of the original data. The two rotational replicas 
for each block ensure that the mazimum rotational delay 
for any data is half of a full rotation. (Two times the 
number of disks are needed to support two-way rotational 
replication; this is shown in the vertical dimension.) The 
rotational replicas expand the seek distance between differ- 
ent data blocks so the mazimum seek distance on each of 
these siz disks is the same as that in a simple three-way 
striped system (denoted by the three disks in the horizontal 
dimension). 


available space is used. If D, = 1, we use all the 
available space. In Figure 3, D, = 3 and D, = 2. 

Because the random read latency is the sum 
of the overhead, the average seek time, and the 
average rotational time, we can approximate the 
overhead-independent part of random read latency 
Tr(Ds, D,) as: 

Ss R 
TrtDy, D,) = 3D, + 2D, 

Given the constraint of D,D, = D, we can prove 
that the following configuration produces the best 
overall latency for independent random I/Os under 
low load: 


(4) 


—— (5) 
Dy = \/ 38D 


The overhead-independent part of latency under this 
configuration is therefore: 


[2SR 
Teeat = 3D 


It is likely that the optimal D, and D, are not inte- 
ger values. For such scenarios, we choose D, to be 


(6) 
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the maximum integer factor of D that is less than 
or equal to the optimal non-integer value. 

Disks with slow rotational speed (large R) de- 
mand a higher degree of rotational replication. In 
terms of the SR-Array illustration of Figure 3(b), 
this argues for a tall thin grid. Conversely, disks with 
poor seek characteristics (large S’) demand a large 
striping factor. In terms of Figure 3(b), this argues 
for a short fat grid. The model indicates that the la- 
tency improvement on an SR-Array is proportional 
to the square root of the number of disks (WD). 

So far, our discussion of the model applies to ran- 
dom access by assuming an average seek of S/3 in 
Equation (6). To capture seek locality, we replace 
S/3 with the average seek of a workload. In the 
later experimental sections, this is accomplished by 
dividing S/3 with a “seek locality index” (L), which 
is observed from the workload. The model does not 
directly account for sequential access. 


Read/Write Latency on an SR-Array 


Now we extend the latency model of an SR-Array 
to model the performance of both read and write 
operations. When performing a write, in the worst 
case scenario of not being able to mask the cost of 
replica propagation, we must incur a write latency 
of Tw (Ds, D,): 


Ss R 


Tw(Ds, Dr) = 3D. +R- aD. 


(7) 

Let the number of reads be X,, the number of 
writes that can be propagated in the background be 
X wp, and the number of writes that are propagated 
in the foreground be X,;. We define the ratio p: 


= Xr ote Xwy 3) 
ai Xx; + Xwot Xuf 
The average read/write latency, T(Ds,D,) = pT r+ 
(1 — p)Tw, can be expressed as: 


S R R 
E(D;,D;-):= 3D. DET + (1—p)(R- 2D, (9) 





The first term is the average seek incurred by any 
request. The second term is the average rotational 
delay consumed by I/O operations that do not result 
in foreground replica propagation (based on Equa- 
tion (2)) with probability p; and the third term is 
the rotational delay consumed by writes whose repli- 
cas are propagated in the foreground due to lack of 
idle time (based on Equation (3)) with probability 
1—p. We can prove that the following configuration 
provides the best overall latency: 
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Ds = Jamin 


3R(2p—-1 
D, = SRD 2p 


The latency under this configuration is: 


2S R(2p — 1) 
3D ae 


A low p ratio calls for a short fat grid in Fig- 
ure 3(b). A p ratio under 50% precludes rotational 
replication and pure striping provides the best con- 
figuration. In the best case, when all write replicas 
can be propagated in the background (or when we 
have no writes at all), writes and reads become in- 
distinguishable as far as this model is concerned, so 
p approaches 1 and the latency improvement is pro- 


portional to VD. 
2.4 Scheduling and Throughput 


We now consider throughput improvements and ad- 
dress the following questions: 1) How do we schedule 
the requests to take advantage of the additional re- 
sources? 2) How do we modify the SR-Array aspect 
ratio models to optimize for throughput? 


Thest = + (1 —p)R 


Scheduling on an SR-Array 


In our SR-Array design, we choose to place a block 
and all its replicas (if any) on a single disk. Re- 
quests are sent to the only disk responsible for the 
data, which queues requests and performs schedul- 
ing on each disk locally and independently. In con- 
trast, in a mirrored system, because any request can 
be scheduled for any copy, devising a good global 
scheduler is non-trivial. We report heuristics-based 
results for mirrored systems in later sections. In this 
section, we focus on scheduling for an SR-Array and 
develop an extension of the LOOK algorithm for an 
SR-Array, which we call RLOOK. 

Under the traditional LOOK algorithm, the disk 
head moves bi-directionally from one end of the disk 
to another, servicing requests that can be satisfied 
by the cylinder under the head. On an SR-Array 
disk, in addition to scanning the disk like LOOK 
in the seek direction, our RLOOK scheduling also 
chooses the replica that is rotationally closest among 
all the replicas during the scan. 

Suppose q is the number of requests to be sched- 
uled for a single RLOOK stroke on a single disk, and 
S, R, D,, D,, D, and p retain their former defini- 
tions from Section 2.3, the average time of a single 
request in the stroke is T(D,, D,): 

R 


S R 
Di Dye tg eae 
a 3) ) qDs +P7p, +0 py(R aD, (12) 
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The first term amortizes g requests over the end-to- 
end seek time, which is an approximation of the time 
needed for a LOOK stroke. The two remaining terms 
are identical to those of Equation (9). (Empirically, 
this is a good approximation when g > 3. When g < 
3, the requests are so sparse that the latency models 
of Equations (9) through (11) are used instead.) 
Starting with Equation (12), we can prove that 
the best latency is achieved with the following con- 


figuration: 
a 2S 
Ds = VR tae =P) 
R(2p-1 
Dees / PGp— U4 D 


Under this configuration, the average request la- 
tency of RLOOK is: 


(13) 


2SR(2p — 1) 


Thest = gD 


+(1—p)R (14) 


Assuming that each request has an overhead of T), 
we can approximate the single disk throughput by 


1 
~ To i Teese 


In addition to the parameters that we have seen 
in the previous models, the aspect ratio is now also 
sensitive to g, a measure of the busyness of the sys- 
tem. A long queue allows for the amortization of the 
end-to-end seek over many requests; consequently, 
we should devote more disks to reducing rotational 
delay. In terms of the SR-Array illustration of Fig- 
ure 3(b), this argues for a tall thin grid. As with 
the model in the last section, a p ratio under 50% 
also precludes rotational replication; pure striping is 
best and Equations (13) through (15) do not apply. 
In the best case, when all replicas are propagated in 
the background, p approaches 1, and the model sug- 
gests that the overhead-independent part of service 
time also improves proportionally to VD. 

Having modeled the throughput of a single disk, 
we attempt to model the throughput of an SR-Array 
with D disks and a total of Q = Dg outstand- 
ing requests, where q is the average queue size per 
disk. We assume that the requests are randomly dis- 
tributed in the system. There could be a load im- 
balance in the form of idle disks when Q is not much 
more than D. The probability of one disk being idle 
is (1 — ay Therefore, the total throughput of the 
system is: 


np~|i- (1-5) ]-™ (16) 


M (15) 


4th Symposium on Operating Systems Design and Implementation 


247 


248 


Although this approximation is derived based on 
reasoning about the presence of idle disks, we shall 
see in Section 4.2 that it is in fact a good approxi- 
mation of more general cases. 

Now that we have described the RLOOK exten- 
sion to LOOK, it is easy to understand a similar 
extension to SATF: RSATF. An RSATF scheduler 
chooses the next request with the shortest access 
time by considering all rotational replicas. It is well 
known that SATF outperforms LOOK [14, 23] by 
considering rotational delay. Our experimental re- 
sults will show that the gap between RLOOK and 
RSATF is smaller because both scheduling algo- 
rithms consider rotational delays. Once the detailed 
low level disk layout is understood, RLOOK is sim- 
ple to implement; it is an attractive local scheduler 
for an SR-Array. 


2.5 Comparing SR-Array with Striped 
Mirror 


In an SR-Array, all replicas exist on the same 
disk. Removing this restriction, we can place these 
replicas at rotationally even positions on different 
disks in a “synchronized” mirror, a mirrored system 
whose spindles are synchronized. We call this layout 
strategy a striped mirror, one flavor of the RAID-10 
systems known in the disk array industry. (RAID-10 
is a broader term that typically does not necessar- 
ily imply the requirement of synchronized spindles 
and the placement of replicas at rotationally even 
positions.) To make the performance of the striped 
mirror competitive to a corresponding SR-Array, we 
must choose replicas intelligently based on rotational 
positioning information. 

Even with these assumptions, a striped mirror is 
not equivalent to an SR-Array counterpart. Con- 
sider a simple example involving only two disks: 
blocks A and B reside on different disks in an SR- 
Array; but each of the disks in a corresponding 
striped mirror has both blocks. Now consider a ref- 
erence stream of AAB. On an SR-Array, the two 
accesses to A are satisfied by two rotational repli- 
cas on one disk, consuming less than a full rotation 
in terms of rotational delay; and the access to B is 
satisfied by a different disk so its access time is inde- 
pendent of the first disk and the first two accesses. 
In an attempt to emulate the behavior of the SR- 
Array, we must send the two accesses to A to the 
two replicas on different disks in a striped mirror; 
but now the access time of B is affected by the first 
two accesses because both disks are busy. In general, 
it is impossible to enforce identical individual access 
time for a stream of requests to an SR-Array and a 
striped mirror. 
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Statistically, the read latency of a striped mirror 
should be slightly better than the latency on a cor- 
responding SR-Array. This is because the average 
of the minimum of the sum of seek and rotational 
delay is smaller than the sum of the average seek 
and average minimum rotational delay. 

In terms of throughput, the simple example above 
shows that for an arbitrary stream of requests, there 
does not exist a general schedule on a striped mirror 
that is equivalent to that on a corresponding SR- 
Array. We do not pretend to know how to optimally 
choose replicas on a striped mirror. Section 3 dis- 
cusses a number of heuristics. The performance of 
our best effort implementation of a striped mirror 
has failed to match that of an SR-Array counter- 
part. 

In terms of feasibility, as spindle synchronization 
is becoming a rarer feature on modern drives, one 
can only approximate striped mirrors on unsynchro- 
nized spindles. In terms of reliability, a striped mir- 
ror is obviously better than an SR-Array. 

In general, it is possible to combine an SR-Array 
with a striped mirror to achieve the benefits of both 
approaches so that some of the replicas are on the 
same disk and some are on different ones. The result 
is the most general configuration: a D, x D,; x Dm 
“SR-Mirror”, where D, implies that only 1/D, of 
the space is used (to reduce seek time), D, is the 
number of replicas on the same disk, and D, is the 
number of replicas on different disks. A Dx1x1 
system is D-way striping. A 1 x 1 x D system is a 
D-way mirror. A D, x D, x1 system is an SR-Array. 
A D, x 1x2 is the most common RAID-10 configu- 
ration. We may approximate the performance of an 
SR-Mirror by replacing D, in the SR-Array models 
with D,; x Dm. 


2.6 Summary of Techniques and Models 


In this section, we have explored how by scaling 
the number of disks in a storage system we can 1) 
reduce seek distance, 2) reduce rotational delay, 3) 
reduce overall latency by combining these techniques 
in a balanced manner, and 4) improve throughput. 
To achieve these goals, the storage system needs to 
be configured based on a number of parameters. We 
have developed simple models that capture the fol- 
lowing parameters that influence the configuration 
decisions: 

e disk characteristics in the form of seek and rota- 

tional characteristics (S and R), 

e read/write ratio (p), 
e busyness of the system (q), and 
e seek locality (L). 
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We note that there does not exist a single “per- 
fect” SR-Array configuration; instead, there may ex- 
ist one “right” configuration for every workload and 
cost/performance specification. As we increase the 
number of disks, and if we properly configure the 
storage system, under the right conditions, the vari- 
ous models of this section suggest the following rule 
of thumb: By using D disks, we can improve the 
overhead-independent part of response time by a fac- 


tor of VD. 


3 Implementation 


In the previous section, we analyzed how to con- 
figure a disk array to deliver better performance as 
we scale the number of disks in the system. In this 
section, we describe the prototype MimdRAID im- 
plementation that puts the theory to test. 


3.1 Overview 


To show the feasibility of our approach, we have 
developed an infrastructure for experimenting with 
the various techniques. Figure 4 illustrates the sys- 
tem components and how they stack against each 
other. There are a number of ways of configuring the 
system. The controlling agent (at the top) can be 
either a user level disk driver or an OS kernel device 
driver. The devices (at the bottom) can be either 
a disk simulator or real SCSI disks. The remain- 
ing components (in the middle) are shared across all 
configurations and are built in a layered fashion. 

The SCSI Abstraction Layer abstracts device spe- 
cific operations such as SCSI device detection at 
boot time, issuing SCSI commands, and retrieval of 
command status. We currently support two different 
10000 RPM SCSI drives: Seagate ST34502LW and 
ST39133LWV; but all experimental results reported 
are based on the ST39133LWV. 

The Calibration Layer is used for calibrating the 
disk and extracting information regarding the phys- 
ical layout of the disk. It keeps track of where the 
disk head is currently located and calculates how 
much time is required to move the head from its cur- 
rent position to a target sector. Section 3.2 provides 
more details about our head-tracking technique. 

A parallel layer to the SCSI abstraction layer is 
the Simulator. We decided to integrate the simulator 
into the architecture to shorten the simulation time 
for long traces: we not only eliminate idle time, but 
also replace I/O time (which can be long) with sim- 
ulated time. The simulator also provides the flexi- 
bility of exploring the impact of changing disk char- 
acteristics. To faithfully simulate the behavior of 
the disks that we currently use in the prototype, the 
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Figure 4: Prototype architecture. 









SCSI Abstraction 


simulator receives timing information from the cali- 
bration layer to configure itself. 

The Scheduling layer implements several disk 
scheduling policies. This layer maintains a 
read/write command queue for each physical disk 
and invokes a user determined policy to pick the 
next request at each scheduling step. We call these 
queues the drive queues. Sections 3.3 and 3.4 pro- 
vide details of the scheduling policies. 

The Disk Configuration Layer provides support 
for configuring a collection of disks using techniques 
such as D-way mirroring, striping, SR-Array, and 
SR-Mirror. It translates I/O requests for a logical 
disk to a set of I/O commands on the physical disks 
and inserts them into the appropriate drive queues. 
The striping unit is 64K bytes in our experiments. 

The topmost Logical Disk Layer is in charge of ex- 
posing the logical disk to the application. The kernel 
device driver exposes a mount point (eg. drive let- 
ter Z: in Windows 2000) to user space. The user 
level driver exposes the disk in the form of an API 
to the application. 


3.2 Predicting Disk Head Position 


The techniques presented in Section 2 rely on 
the driver’s ability to accurately predict the disk 
head location and the cost of disk operations such 
as track switches, seeks, and rotational placement. 
The driver also needs information on the layout of 
the physical sectors on the disk. 

Previous proposals that depend on the knowl- 
edge of head positions have relied on hardware sup- 
port (5, 27]. Unfortunately, this level of support is 
not always available on commodity drives. We have 
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developed a software-only head-tracking method. 
Our scheme requires issuing read accesses to a fixed 
reference sector at periodic intervals. The head 
tracking algorithm computes the disk head position 
based on the time stamp taken immediately after 
the completion of the most recent read operation 
of the reference sector. The basic idea is that the 
time between two read accesses of the reference sec- 
tor is always an integral multiple of the full rota- 
tion time plus an unpredictable OS and SCSI over- 
head. By gradually increasing the time interval be- 
tween adjacent read requests to the reference sec- 
tor, we amortize the overhead of reading the ref- 
erence sector. Our experiments show that periodic 
re-calibration at an interval of two minutes yields 
predictions that have an error of only 1% of a full 
rotation time with 98% confidence. It is encourag- 
ing that we can achieve a high degree of accuracy 
with a low overhead associated with reading the ref- 
erence sector every two minutes. To further reduce 
this overhead, we can exploit the timing information 
and known disk head location at the end of a request. 
We have not implemented this optimization. 

To obtain an accurate image of the disk, we use 
methods that are similar to those used by Wor- 
thington [29] for determining the logical to physi- 
cal sector mapping. We obtain information on disk 
zones, track skew, bad sectors, and reserved sectors 
through a sequence of low-level disk operations. 

The last piece of information that we measure 
is the cost of performing track switches and seeks. 
Small errors in these timing measurements may in- 
troduce a penalty that is close to a full rotation. To 
reduce the number of rotation misses, we introduce 
a slack of k sectors so that when the mechanism pre- 
dicts the head to be less than k sectors away from the 
target, the scheduler conservatively chooses the next 
rotational replica after the target. This slack can be 
adjusted by a real time feedback loop to ensure that 
more than 99% of the requests are on target. 


3.3 Scheduling Reads 


The head-tracking mechanism, along with accu- 
rate models of the disk layout and the seek profile, al- 
lows us to implement sophisticated local scheduling 
policies on individual disks; these include RLOOK, 
SATF, and RSATF. 

Scheduling on a mirrored system, however, is 
more complex due to the fact that a request can be 
serviced by any one of the disks that have the data. 
We use the following heuristic scheduling algorithm. 
When a read request arrives, if some of the disks 
that contain the data are idle, the scheduler imme- 
diately sends the request to the idle disk head that is 
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closest to a copy of the data. If all disks that contain 
the desired data are busy, the logical disk layer du- 
plicates the request and inserts the copies into the 
drive queues of all these disks. As soon as such a 
request is scheduled on one disk, all other duplicate 
requests are removed from all other drive queues. 
When a disk completes processing a request, its lo- 
cal scheduler greedily chooses the “nearest” request 
from its own drive queue. Although this heuristic 
algorithm may not be optimal, it can avoid load im- 
balance and works fairly well in practice. 


3.4 Delayed Writes 


While multiple copies of data reduce read latency, 
they present a challenge for performing writes ef- 
ficiently because more than one copy needs to be 
written. We need to make D, x D,, copies for a 
D, x Dy X Dm SR-Mirror. It is however feasible to 
propagate the copies lazily when the disks are idle. 
We can issue a write to the closest copy and de- 
lay writing the remaining copies. For back-to-back 
writes to the same data block, which happens fre- 
quently for data that die young [21], we can safely 
discard unfinished updates from previous writes. 

In our implementation, we maintain for each disk 
a delayed write queue, which is distinct from the 
foreground request queue. When a write request ar- 
rives, we initially schedule the first write using the 
foreground request queue just as we do for reads. 
As soon as writing one of the replicas is scheduled, 
we set aside the remaining update operations for 
the other replicas in the individual delayed write 
queues. Entries from this queue are serviced when 
the foreground request queue becomes empty. De- 
layed writes require us to make a copy of the data 
because the original buffer is returned to the OS as 
soon as the first write completes. 

To provide crash recovery, the physical location 
of the first write is stored in a delayed write metadata 
table that is kept in NVRAM. Note that it is not nec- 
essary to store a copy of the data itself in NVRAM- 
the physical location of the first write is sufficient 
for completing the remaining delayed writes upon 
recovery; so the table is small. When the metadata 
table fills up to a threshold (10,000 entries in the cur- 
rent implementation), we force delayed writes out by 
moving them to the foreground request queue. 


3.5 Validating the Integrated Simulator 


So far, we have described the architecture and the 
components of the integrated MimdRAID simulator 
and device driver. To establish 1) the accuracy of 
the head-tracking mechanism, and 2) the validity of 
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Operating system | Microsoft Windows 2000 


CPU type Intel Pentium HI 733 MHz 
SCSI Interface Adaptec 39160 

SCSI bus speed 
Disk model 


Seagate ST39133LWV 9.1 GB 


10000 
5.2 ms read, 6.0 ms write 


Table 1: Platform characteristics. 
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Figure 5: Comparison of throughput results from the pro- 
totype system and the simulator. We use two random 
workloads, one with just reads, and another with an equal 
number of reads and writes. The request size is 512 bytes. 
The array configuration is a 2 x 3 SR-Array based on the 
RSATF scheduler. Writes are synchronously propagated 
in the foreground. We vary the number of outstanding 
requests (on the z-azis). 


the simulator, we perform a series of experiments us- 
ing “Iometer”, a benchmark developed by the Intel 
Server Architecture Lab [13]. Iometer can generate 
different workloads of various characteristics includ- 
ing read/write ratio, request size, and the maximum 
number of outstanding requests. We use Iometer to 
generate equivalent workloads to drive both the de- 
vice driver and the simulator. Table 1 lists some 
platform characteristics of the prototype. Figure 5 
shows the Iometer result. The throughput discrep- 
ancy between the simulator and the prototype under 
all queueing conditions is under 3%. 

To shed more light on the accuracy of the model, 
in Table 2, we give more detailed statistics of sub- 
jecting the model and the prototype to the “Cello 
base” file system workload (described in Section 4.1). 
The mean prediction error and low standard devia- 
tion show that there are essentially only two types of 
requests: 99.8% of the predictions are almost right 
on target, and 0.2% of the predictions miss their 
targets by a very small amount of time and incur 
a full rotation penalty. The net effect of these rare 
rotation misses, however, is insignificant in terms of 
overall access time. These results indicate that the 


Misses 

Mean Prediction Error 
Standard Deviation of Error 
Average Access Time 
Demerit 

Demerit/ Access ‘Time 





Table 2: Detailed statistics of model accuracy when sub- 
jected to the “Cello base” file system workload. The con- 
figuration is a 2x3 SR-Array based on RSATF scheduling. 
I/O requests in this experiment are physical I/O requests 
sent to drives; and access time is that of a physical I/O. 
Prediction error is the difference between the access time 
predicted by the scheduler and the actual measured access 
time of a single request. We calculate demerit using the 
definition by Ruemmler and Wilkes [21]. 


simulator faithfully simulates a real SR-Array, allow- 
ing us to understand the behavior of the SR-Array 
using simulation-based results in later sections. 


4 Experimental Results 


In this section, we evaluate the performance of 
the prototype MimdRAID under two sets of experi- 
ments. The first set of experiments is based on play- 
ing real-world file system and transaction processing 
traces on the MimdRAID simulator. The second set 
of experiments is based on running on the prototype 
itself a synthetic workload generator that is designed 
to stress it in ways beyond what is possible with the 
traces. The purposes of the experiments are to: 1) 
validate the models of Section 2, 2) show the effec- 
tiveness and importance of workload-driven config- 
uration, and 3) demonstrate the use of scaling the 
number of disks as a cost-effective means of improv- 
ing performance for certain workloads. 


4.1 Macro-benchmarks 


We test our system using two traces. Cello is a 
two month trace taken on a server running at HP 
Labs [21]. It had eight disks and was used for run- 
ning simulations, compilations, editing, and reading 
mail and news. We use one week’s worth of trace 
data (for the period of 5/30/92 through 6/6/92). 
TPC-C is a disk trace (collected on 5/03/94) of an 
unaudited run of the Client/Server TPC-C bench- 
mark running at approximately 1150 tpmC on a 100 
Warehouse database. 


Logical Data Sets 


The 9.1 GB Seagate disks that we use are much 
larger and faster than any of the original disks used 
in the trace; therefore, we do not map the original 
small disks one-to-one onto our large disks. Instead, 
we group the original data into three logical data 
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Data size 


Duration 
Avg. I/O rate 


35.8% 


Asyne. writes 
Seek 
locality (L) 
Read after 
write (1 hour) 


y189% | 161% | 0 _| 


sas 19% 


Table 3: Trace characteristics. The “seek locality” row 
is calculated as the ratio between the average of random 
seek distances on that disk and the average seek distance 
observed in the trace. This ratio is used to adjust the S 
parameter when applying the models of Section 2 in sub- 
sequent discussions, The “read after write (1 hour)” row 
lists the percentage of I/O operations that are reads that 
occur less than one hour after writing the same data. 





sets and study how to place each data set in a disk 
array made of new disks. 


The first data set consists of all the Cello disk 
data with the exception of disk 6, which houses 
“/usr/spool/news”. We merge these separate Cello 
disk traces based on time stamps to form a single 
large trace. The data from different disks are con- 
catenated to form a single data set. We refer to this 
workload as “Cello base” in the rest of the paper. 
The second data set consists solely of Cello disk 6. 
This disk houses the news directory; it exhibits ac- 
cess patterns that are different from the rest of the 
Cello disks and accounts for 47% of the total ac- 
cesses. We refer to this workload as “Cello disk 6” 
in the rest of the paper. The third data set consists 
of data from 31 original disks of the TPC-C trace. 
We merge these traces to form a single large trace; 
and we concatenate these disks to form a single data 
set as well. We refer to this workload as “TPC-C”. 


Table 3 lists the key characteristics of the trace 
data sets. Of particular interest is the last row, 
which reports the fraction of I/Os that are reads to 
recently written data. Although this ratio is high for 
TPC-C, it does not rise higher for intervals longer 
than an hour. Together with the amount of avail- 
able idle time, this ratio impacts the effectiveness 
of the delayed write propagation strategy and influ- 
ences the array configurations. 


To test our system with various load conditions, 
we also uniformly scale the rate at which the trace 
is played based on the time stamp information. For: 
example, when the scaling rate is two, the traced 
inter-arrival times are halved. 
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Figure 6: Comparison of average I/O response time of 
the Cello file system workloads on different disk array 
configurations. The SR-Array uses the RSATF scheduler 
and the remaining configurations use the SATF scheduler. 
The two configurations labeled as “RAID-10” and “Dm- 
way mirror” are reliable configurations and are denoted 
by thicker curves. This convention is used throughout the 
rest of the figures. 


Playing Cello Traces at Original Speed 


As a starting point, we place the Cello base data 
set on one Seagate disk and the Cello disk 6 data 
set on another. Although we have fewer number of 
spindles in this case than in the original trace, the 
original speed of the Cello traces is still sufficiently 
low that we are effectively measuring individual I/O 
latency most of the time. There is also sufficient 
idle time to mask the delayed write propagations. 
Therefore, we apply the model in Section 2.3 (Equa- 
tion (5)) to configure the SR-Array. When applying 
the formulas, we account for the different degree of 
seek locality (L) in Table 3 by replacing S with S/L. 
We perform replica propagation in the background 
for all configurations. Although all write operations 
from the trace are played, we exclude those of asyn- 
chronous writes when reporting response time; most 
of the asynchronous writes are generated by the file 
system sync daemon at 30 second intervals. All re- 
ported response times include an overhead of 2.7 
ms, which includes various processing times, trans- 
fer costs, track switch time, and mechanical acceler- 
ation/deceleration times, as defined in Section 2.3. 


Figure 6 shows the performance improvement on 
the Cello workloads as we scale the number of disks 
under various configurations. The curve labeled as 
“SR-Array” shows the performance of the best SR- 
Array configuration for a given number of disks. The 
SR-Array performs well because it is able to effec- 
tively distribute disks to the seek and rotational di- 
mensions in a balanced manner. In contrast, the 
performance of simple striping is poor due to the 
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Figure 7: Configurations of the SR-Array for the two 
workloads of Figure 6. The curves show the performance 
of the SR-Array configuration recommended by the model 
of Equation (5). Each point symbol in the graph shows 
the performance of an alternative SR-Array configuration 
with a different number of rotational replicas (Dr). 


lack of rotational delay reduction. This effect is 
more apparent for larger numbers of disks due to 
the diminishing returns from seek distance reduc- 
tion. The performance of RAID-10 is intermediate 
because the two replicas allow for a reduction in the 
rotational delay to a limited extent. D-way mirror- 
ing is the closest competitor to an SR-Array because 
of its high degree of flexibility in choosing which 
replica to read. (We will expose the weakness of 
D-way mirroring in subsequent experiments.) Note 
that our SATF-based implementation of RAID-10 
and D-way mirroring are highly optimized versions 
based on rotational positioning knowledge. 

The figure also shows that the latency model of 
Section 2.3 is a good approximation of the SR-Array 
performance. The anomalies on the model curves 
are due to the two following practical constraints: 
1) D, and D, must be integer factors of the total 
number of disks D, and 2) our implementation re- 
stricts the largest degree of rotational replication to 
six. This restriction is due to the difficulty of prop- 
agating more copies within a single rotation, as ro- 
tational replicas are placed on different tracks and a 
track switch costs about 900 ys. Due to these con- 
straints, for example, the largest practical value of 
D,. for D = 9 is only three, much smaller than the 
non-integer solution of Equation (5) (5.8 for Cello 
base and 11.6 for Cello disk 6). 

While the Cello base data set consumes an entire 
Seagate disk, the Cello disk 6 data set only occupies 
about 15% of the space on a single Seagate disk; 
so the maximum seek delay of Cello disk 6 is small 
to begin with for all configurations. Consequently, 
a larger D, for an SR-Array is desirable as we in- 


lA 


Response Time (ms) 
> 
w 


Response Time (ms) 
> 


N 
n 





215 2 25 30 36 
Disks 
(b) SR-Array configurations 


215° 2 25 3 36 
Disks 
(a) TPC-C performance 


Figure 8: Average [/O response time of the TPC-C trace. 
(a) Comparison of striping, RAID-10, and SR-Array. (b) 
Comparison of alternative configurations of an SR-Array. 


crease the number of disks. With these large D, 
values, however, the practical constraints enumer- 
ated above start to take effect. Coupled with the 
fact that seek time is no longer a linear function of 
seek distance at such short seek distances, this ex- 
plains the slightly more pronounced anomalies of the 
SR-Array performance with a large number of disks 
on the Cello disk 6 workload. 

Figure 7 compares the performance of other pos- 
sible SR-Array configurations with that of the con- 
figuration chosen by the model. For example, when 
the number of disks is six, the model recommends a 
configuration of D, x D, = 2 x 3 for Cello base. The 
three alternative configurations are 1 x 6, 3 x 2, and 
6x1. The figure shows that the model is largely suc- 
cessful at finding good SR-Array configurations. For 
example, on Cello base, with six disks, the SR-Array 
is 1.23 times as fast as a highly optimized RAID-10, 
1.42 times as fast as a striped system, and 1.94 times 
as fast as the single disk base case. 


Playing the TPC-C Trace at Original Speed 


Although a single new Seagate disk can accommo- 
date the entire TPC-C data set in terms of capacity, 
it cannot support the data rate of the original trace. 
Indeed, only a fraction of the space on the original 
traced disks was used to boost the data rate. We 
start with 12 disks for each of the array configura- 
tions. Figure 8 shows the performance as we scale 
the number of disks beyond the starting point. The 
data rate experienced by each disk under this work- 
load is much higher than that under the Cello sys- 
tem described in the last section. The workload also 
contains a large fraction of writes so it also stresses 
delayed write propagation as idle periods are shorter. 

Compared to Figure 6, two curves are missing 
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Figure 9: Comparison of local disk schedulers for different 
configurations as we raise the I/O rate. We use six disks 
for Cello base (a), and 36 for TPC-C (b). 


from Figure 8. One is D—way mirroring—it is im- 
possible to support the original data rate while at- 
tempting to propagate D replicas for each write. An- 
other missing curve is the latency model—the high 
data rate renders the latency model inaccurate. The 
spirit of Figure 8, however, is very much similar to 
that of Figures 6 and 7: a properly configured SR- 
Array is faster than a RAID-10, which is faster than 
a striped system. 

What is more interesting is the fact that strip- 
ing, the only configuration that does not involve 
replication, is not the best configuration even un- 
der the high update rate exhibited by this workload. 
There are at least two reasons. First, even under this 
higher I/O rate, there are still idle periods to mask 
replica propagations. Second, even without idle pe- 
riods, there exists a tradeoff between the benefits 
received from reading the closest replicas and the 
cost incurred when propagating replicas, as demon- 
strated by the models of Section 2.2; a configuration 
that can successfully exploit this tradeoff excels. For 
example, with 36 disks, a 9 x 4x 1 SR-Array is 1.23 
times as fast asa 18 x 1 x 2 RAID-10, and 1.39 times 
as fast as a 36 x 1 x 1 striped system. 


Playing Traces at Accelerated Rate 


Although the original I/O rate of TPC-C is higher 
than that of the Cello traces, it does not stress the 
12-disk arrays discussed in the last section. We now 
raise the I/O rates to stress the various configura- 
tions. For example, when the “scale rate” is two, we 
halve the inter-arrival time of requests. 

Before we compare the different array configu- 
rations, we first consider the impact of the local 
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Figure 10: Comparison of I/O response time on different 
configurations as we raise the I/O rate. We use siz disks 
for Cello base (a), and 36 for TPC-C (b). 


disk schedulers. Figure 9 evaluates four schedulers: 
LOOK and SATF for striping, and RLOOK and 
RSATF for an SR-Array. Given a particular request 
arrival rate, the gap between RLOOK and RSATF 
is smaller than that between LOOK and SATF. This 
is because both RLOOK and RSATF take rotational 
positioning into consideration. Although it is a well 
known result that SATF out-performs LOOK, we 
see that SATF alone is not sufficient for address- 
ing rotational delays if the array is mis-configured 
to begin with. For example, under the Cello base 
workload, a 2 x 3x 1 SR-Array significantly out per- 
forms a 6 x 1 x 1 striped system even if the former 
only uses an RLOOK scheduler while the latter uses 
an SATF scheduler. In the rest of the discussions, 
unless specified otherwise, we use the RSATF sched- 
uler for SR-Arrays and the SATF scheduler for other 
configurations. 

Figure 10 shows the performance of the various 
configurations while we fix the number of disks for 
each workload and vary the rate at which the trace 
is played. Under the Cello base workload (shown in 
Figure 10(a)), the 6-way mirror and the 1 x 6 SR- 
Array deliver the lowest sustainable rates. These 
configurations make the largest number of replicas 
and it is difficult to mask the replica propagation 
under high request rates. 6-way mirroring is better 
than the 1 x 6 SR-Array, because the 6-way mir- 
ror can afford the flexibility of choosing any disk to 
service a request, so it can perform better load bal- 
ancing. The 2 x 3 SR-Array is best for all the arrival 
rates that we have examined; this is because the ben- 
efits derived from the extra rotational replicas out- 
weigh the cost. If we demand an average response 
time no greater than 15 ms, the 2 x 3 x 1 SR-Array 
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can support a request rate that is 1.3 times that of 
a3x1x2RAID-10 and 2.6 times that of a6x 1x1 
striped system. 

The situation is different for the TPC-C workload 
(shown in Figure 10(b)). Under the original trace 
playing rate, the 9 x 4 x 1 SR-Array is best. As we 
raise the request arrival rate, we must successively 
reduce the degree of replication; so the role of the 
best configuration passes to the 12x 3x1, 18x1x 2, 
18 x 2x1, and finally, 36x 1x 1 configurations, in that 
order. If we again demand an average response time 
no greater than 15 ms, the 36 x 1 x 1 configuration 
can support a request rate that is 1.3 times that of 
a 18 x 2 x 1 configuration and 2.1 times that of a 
18 x 1 x 2 RAID-10 configuration. 


Comparison Against Memory Caching 


We have seen that it is possible to achieve significant 
performance improvement by scaling the number of 
disks. We now compare this approach against one 
alternative: simply adding a bigger volatilé memory 
cache. The memory cache performs LRU replace- 
ment. Synchronous writes are forced to disks in both 
alternatives. In the following discussion, we assume 
that the price per MB ratio between memory and 
disk is M. At the time of this writing, 256 MB of 
memory costs $300, an 18 GB SCSI disk costs $400, 
and these prices give an M value of 57. 

Figure 11(a) examines the impact of memory 
caching on the Cello base workload. At the trace 
scale rate of one, we need to cache an additional 
1.5%, or 126 MB, of the file system in memory to 
achieve the same performance improvement of dou- 
bling the number of disks; and we need to cache 
4%, or 336 MB, of the file system to reach the per- 
formance of a four-disk SR-Array. M needs to be 
less than 67 and 75 respectively in order for mem- 
ory caching to be cost effective, which it is today. 

At the trace scale rate of three, using similar rea- 
soning, we can conclude that M needs to be less than 
20 in order for memory caching to be more cost ef- 
fective than doubling the number of disks. Beyond 
this budget, at this I/O rate, the diminishing local- 
ity and the need to flush writes to disks make the 
addition of memory less attractive. The addition of 
disks, however, speeds up all I/O operations, albeit 
at a diminishing rate. 

Figure 11(b) examines the impact of memory 
caching on the TPC-C workload, which has much 
less locality. We start with a 12-disk SR-Array. At 
a scale rate of one, M needs to be less than 80 in 
order for memory caching to be a cost effective al- 
ternative to increasing the number of disks to 18 or 
24. Adding memory is a more attractive alternative. 
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Figure 11: Comparison of the effects of memory caching 
and scaling the number of disks. The two SR-Array curves 
show the performance improvement achieved by scaling the 
number of disks (bottom z-azis) and they correspond to 
playing the traces at the original speed and three times 
the original speed. The two Memory curves show the per- 
formance improvement achieved by scaling the amount of 
memory caching (top z-azis). 


At a scale rate of three, M needs to be less than 
24 for memory caching to be more cost effective than 
increasing the number of disks to 18. Beyond this 
budget, adding memory ‘provides little additional 
performance gain, while increasing the number of 
disks from 18 to 36 can provide an additional 1.76 
times speedup. 


4.2. Micro-benchmarks 


To further explore the behavior of the prototype, 
we use the Intel Iometer benchmark to stress some 
array configurations in a controlled manner. In all 
the following micro-benchmarks, we use a seek local- 
ity index of 3, as defined in Section 2.3. We measure 
the throughput in these experiments. 


Throughput Models 


In this experiment, we perform only random read op- 
erations on the disk array while maintaining a con- 
stant number of outstanding requests. (We examine 
writes more fully in the next subsection.) The goals 
are 1) to understand the scalability of the disk array, 
2) to understand the behavior of the system under 
different load conditions, and 3) to validate (part of) 
the throughput model of Section 2.4. 

Figure 12 shows that the SR-Array using the 
RSATF scheduler scales well as we increase the num- 
ber of disks under this Iometer workload. The 
RLOOK scheduler is a close approximation of the 
RSATF scheduler; and the RLOOK-based through- 
put model closely captures the behavior of the SR- 
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Figure 12: Throughput as a function of array configu- 
ration, number of disks, and queue length. The queue 
lengths are (a) 8, and (b) 32. 


Array, including the throughput degradation expe- 
rienced when the queue length is short. 

The SATF-based striped and RAID-10 systems 
do not scale as well as the SR-Array. The through- 
put gap between all these systems, however, nar- 
rows as the queue length increases since the SATF 
scheduler can overcome the lack of rotational repli- 
cas when it has a large number of requests to choose 
from. 


Replica Propagation Cost 


We now analyze configurations under foreground 
write propagation and validate the model in Sec- 
tion 2.4. 

Figure 13 shows the throughput results as Iome- 
ter maintains a constant queue length of mixed reads 
and writes. Each write leads to immediate replica 
propagations; so write ratio and foreground write ra- 
tio are the same, namely, 1 — p, where p is defined 
by Equation (8) of Section 2.3. 

Among the configurations shown in the figure, 
RAID-10 has the worst performance under high 
write ratios. To understand why, consider the prop- 
agation of a single write: the 3 x 2 x 1 SR-Array 
requires a single seek followed by writing 2 rota- 
tional replicas in a single cylinder; but a correspond- 
ing 3x 1x2 RAID-10 requires 2 seeks so the amount 
of arm movement tends to be greater. 

The performance of the striped 6 x 1 x 1 con- 
figurations degrade slightly for high write ratios as 
writes are slightly more expensive than reads. 

The performance difference between a 3 x 2 x 1 
SR-Array and a 6 x 1 x 1 striped system depends 
on the write ratio with the former better for low 
write ratios. If we only consider rotational delay, 
the rotational replication model of Section 2.2 would 
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Figure 13: Throughput as a function of foreground write 
propagation rate and queue length. The total number of 
disks is siz. The queue lengths are (a) 8, and (b) 32. 


imply that the cross-over point between them under 
LOOK/RLOOK scheduling should be close to the 
50% write ratio. If we also consider seek distance, 
the 3 x 2 x 1 SR-Array has worse seek performance 
so the actual cross-over point is less than 50%. 

Because SATF benefits a 6 x 1 x 1 configuration 
more than RSATF does to a 3 x 2 x 1 configuration, 
the cross-over point between these systems under 
SATF/RSATF scheduling is to the left of that under 
LOOK/RLOOK scheduling. This distance is even 
greater when the queue is longer (in Figure 13(b)). 

The figure also shows that the RLOOK through- 
put model (Equation (16)) closely tracks the exper- 
imental result under varying write ratios, 


5 Related Work 


A number of previous storage systems were de- 
signed to take into consideration the tradeoff be- 
tween capacity and performance. Hou and Patt [10] 
performed a simulation study of the tradeoff between 
mirroring and RAID-5. 

The HP AutoRAID incorporated both mirroring 
and RAID-5 into a two-level hierarchy [28]. The 
mirrored upper level provided faster small writes at 
the expense of consuming more storage, while the 
RAID-5 lower level was more frugal in its use of disk 
space. Its primary focus was solving the small write 
problem of RAID-5. 

We have taken the tradeoff between capacity and 
performance a step further by 1) improving latency 
and throughput of all I/O operations, 2) being able 
to benefit from more than twice the excess capacity, 
and 3) providing a means of systematically config- 
uring the extra disk heads. 

The HP Ivy project [15] was a simulation study of 
how a high degree of replication could improve read 
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performance. Our study differs from Ivy in several 
ways. First, Ivy only explored reducing seek dis- 
tance and left rotational delay unresolved. Second, 
Ivy only examined mirroring. The third difference 
is a feature of Ivy that we intend to incorporate into 
our system in the future: Ivy dynamically chose the 
candidate and the degree of replication by observing 
access patterns. We are currently researching a wide 
range of access patterns (including those at the file 
system level) that can be used to dynamically tune 
the array configuration. 

Matloff [17] derived a model of linear improve- 
ment of seek distance as one increased the number of 
disks devoted to striping. Bitton and Gray derived a 
model of seek distance reduction (3] and studied seek 
scheduling [2] for a D-way mirror. Neither study 
considered the impact of rotational delay. 

Dishon and Liu [6] considered latency reduc- 
tion on either synchronized or unsynchronized D- 
way mirrors. A synchronized mirror can reduce 
foreground propagation latency because the multi- 
ple copies can be written at nearly the same time if 
we insist that the replicas are placed at rotationally 
identical positions. This advantage comes at the cost 
of poor read latency because it allows no rotational 
delay reduction for reads. 

Polyzois [20] proposed careful scheduling of de- 
layed writes to different disks in a mirror to max- 
imize throughput, a technique that can potentially 
benefit delayed writes in our systems when the repli- 
cas are on different disks. 

The “distorted mirror” [19] provided an alterna- 
tive way of improving the performance of writes in 
a mirror. It performed writes initially to rotation- 
ally optimal but variable locations and propagated 
them to fixed locations later. This technique can be 
integrated with our delayed write strategy as well. 

Lumb et al. [16] exploited “free bandwidth” that 
is available when the disk head is in between ser- 
vicing normal requests in a busy system. The free 
bandwidth was used for background I/O activity. 
Propagating replicas in our system is a good use of 
this free bandwidth. 

Ng examined intra-track replication as a means 
of reducing rotational delay [18]. We extend this ap- 
proach to improve large I/O bandwidth by perform- 
ing rotational replication across different tracks. 

The importance of reducing rotational delay has 
long been recognized. Seltzer and Jacobson inde- 
pendently examined a number of disk scheduling al- 
gorithms that take rotational position into consid- 
eration (14, 23]. Our work considers the impact of 
reducing rotational delay in array configurations in 
a manner that balances the conflicting goal of reduc- 


ing seek and rotational delay at the same time. 

At the time of this writing, the Trail system [12] 
independently developed a disk head tracking mech- 
anism that is similar to ours. Trail used this infor- 
mation to perform fast log writes to carefully chosen 
rotational positions. A similar write strategy was 
in use in the earlier Mime system [5], but Mime re- 
lied on hardware support for its rotational position- 
ing information. Aboutabl et al. developed a sim- 
ilar disk timing measurement strategy, which was 
used to model the response time of individual I/O 
requests [1]. 

A number of drive manufacturers have incor- 
porated SATF-like scheduling algorithms in their 
firmware. An early example was the HP C2490A [9]. 
Our host-based software solution enables the em- 
ployment of such scheduling on drives that do not 
support it internally. Furthermore, it allows experi- 
mentation with strategies such as rotational replica 
selection, strategies that would have been difficult 
to realize even on drives that support intelligent 
scheduling internally. On the other hand, if the drive 
does support intelligent internal scheduling, an in- 
teresting question that this study has not addressed 
is how we can adapt our algorithm for such drives 
without relying on complex predictions. 

One of our goals of studying the impact 
of altering array configurations is to understand 
how to configure a storage system given certain 
cost/performance specifications. The “attribute- 
managed storage” project [7] at HP shares this goal, 
although its focus is at the disk array level as op- 
posed to individual drive level. 


6 Conclusion 


In this paper, we have described a way of de- 
signing disk arrays that can flexibly reduce seek and 
rotational delay in a balanced manner. We have 
presented a series of analytical models that take 
into consideration disk and workload characteris- 
tics. By incorporating these models and a robust 
software-based disk head position prediction mecha- 
nism, the MimdRAID prototype can deliver latency 
and throughput results unmatched by conventional 
approaches. 
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Abstract. This paper explores interposed request 
routing in Slice, a new storage system architec- 
ture for high-speed networks incorporating network- 
attached block storage. Slice interposes a request 
switching filter — called a uprory — along each 
client’s network path to the storage service (e.g., 
in a network adapter or switch). The pproxy inter- 
cepts request traffic and distributes it across a server 
ensemble. We propose request routing schemes for 
I/O and file service traffic, and explore their effect 
on service structure. 


The Slice prototype uses a packet filter proxy 
to virtualize the standard Network File System 
(NFS) protocol, presenting to NFS clients a uni- 
fied shared file volume with scalable bandwidth and 
capacity. Experimental results from the industry- 
standard SPECsfs97 workload demonstrate that 
the architecture enables construction of powerful 
network-attached storage services by aggregating 
cost-effective components on a switched Gigabit 
Ethernet LAN. 


1 Introduction 


Demand for large-scale storage services is growing 
rapidly. A prominent factor driving this growth is 
the concentration of storage in data centers hosting 
Web-based applications that serve large client pop- 
ulations through the Internet. At the same time, 
storage demands are increasing for scalable comput- 
ing, multimedia and visualization. 


A successful storage system architecture must scale 
to meet these rapidly growing demands, placing 
a premium on the costs (including human costs) 
to administer and upgrade the system. Commer- 
cial systems increasingly interconnect storage de- 
vices and servers with dedicated Storage Area Net- 
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works (SANs), e.g., FibreChannel, to enable incre- 
mental scaling of bandwidth and capacity by attach- 
ing more storage to the network. Recent advances 
in LAN performance have narrowed the bandwidth 
gap between SANs and LANs, creating an oppor- 
tunity to take a similar approach using a general- 
purpose LAN as the storage backplane. A key chal- 
lenge is to devise a distributed software layer to 
unify the decentralized storage resources. 


This paper explores interposed request routing in 
Slice, a new architecture for network storage. Slice 
interposes a request switching filter — called a 
Eprozy — along each client’s network path to the 
storage service. The pproxy may reside in a pro- 
grammable switch or network adapter, or in a self- 
contained module at the client’s or server’s interface 
to the network. We show how a simple pproxy can 
virtualize a standard network-attached storage pro- 
tocol incorporating file services as well as raw device 
access. The Slice proxy distributes request traf- 
fic across a collection of storage and server elements 
that cooperate to present a uniform view of a shared 
file volume with scalable bandwidth and capacity. 


This paper makes the following contributions: 


e It outlines the architecture and its implementa- 
tion in the Slice prototype, which is based on a 
psproxy implemented as an IP packet filter. We 
explore the impact on service structure, recon- 
figuration, and recovery. 


e It proposes and evaluates request routing poli- 
cies within the architecture. In particular, we 
introduce two policies for transparent scaling of 
the name space of a unified file volume. These 
techniques complement simple grouping and 
striping policies to distribute file access load. 


e It evaluates the prototype using synthetic 
benchmarks including SPECsfs97, an industry- 
standard workload for network-attached stor- 
age servers. The results demonstrate that the 
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Figure 1: Combining functional decomposition and 
data decomposition in the Slice architecture. 


system is scalable and that it complies with 
the Network File System (NFS) V3 standard, a 
popular protocol for network-attached storage. 


This paper is organized as follows. Section 2 out- 
lines the architecture and sets Slice in context with 
related work. Section 3 discusses the role of the 
pproxy, defines the request routing policies, and 
discusses service structure. Section 4 describes the 
Slice prototype, and Section 5 presents experimental 
results. Section 6 concludes. 


2 Overview 


The Slice file service consists of a collection of 
servers cooperating to serve an arbitrarily large vir- 
tual volume of files and directories. To a client, the 
ensemble appears as a single file server at some vir- 
tual network address. The pproxy intercepts and 
transforms packets to redirect requests and to rep- 
resent the ensemble as a unified file service. 


Figure 1 depicts the structure of a Slice ensemble. 
Each client’s request stream is partitioned into three 
functional request classes corresponding to the ma- 
jor file workload components: (1) high-volume I/O 
to large files, (2) I/O on small files, and (3) oper- 
ations on the name space or file attributes. The 
pproxy switches on the request type and arguments 
to redirect requests to a selected server responsible 
for handling a given class of requests. Bulk I/O op- 
erations route directly to an array of storage nodes, 
which provide block-level access to raw storage ob- 
jects. Other operations are distributed among spe- 
cialized file managers responsible for small-file I/O 


and/or name space requests. 


This functional decomposition diverts high-volume 
data flow to bypass the managers, while allowing 
specialization of the servers for each workload com- 
ponent, e.g., by tailoring the policies for disk layout, 
caching and recovery. A single server node could 
combine the functions of multiple server classes; we 
separate them to highlight the opportunities to dis- 
tribute requests across more servers. 


The pproxy selects a target server by switching on 
the request type and the identity of the target file, 
name entry, or block, using a separate routing func- 
tion for each request class. Thus the routing func- 
tions induce a data decomposition of the volume 
data across the ensemble, with the side effect of cre- 
ating or caching data items on the selected man- 
agers. Ideally, the request routing scheme spreads 
the data and request workload in a balanced fashion 
across all servers. The routing functions may adapt 
to system conditions, e.g., to use new server sites 
as they become available. This allows each work- 
load component to scale independently by adding 
resources to its server class. 


2.1 The pproxy 


An overarching goal is to keep the pproxy simple, 
small, and fast. The yproxy may (1) rewrite the 
source address, destination address, or other fields of 
request or response packets, (2) maintain a bounded 
amount of soft state, and (3) initiate or absorb pack- 
ets to or from the Slice ensemble. The uproxy does 
not require any state that is shared across clients, 
so it may reside on the client host or network in- 
terface, or in a network element close to the server 
ensemble. The proxy is not a barrier to scalability 
because its functions are freely replicable, with the 
constraint that each client’s request stream passes 
through a single pproxy. 


The pproxy functions as a network element within 
the Internet architecture. It is free to discard its 
state and/or pending packets without compromis- 
ing correctness. End-to-end protocols (in this case 
NFS/RPC/UDP or TCP) retransmit packets as 
necessary to recover from drops in the proxy. Al- 
though the pproxy resides “within the network”, it 
acts as an extension of the service. For example, 
since the proxy is a layer-5 protocol component, it 
must reside (logically) at one end of the connection 
or the other; it cannot reside in the “middle” of the 
connection where end-to-end encryption might hide 
layer-5 protocol fields. 
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2.2 Network Storage Nodes 


A shared array of network storage nodes provides all 
disk storage used in a Slice ensemble. The uproxy 
routes bulk I/O requests directly to the network 
storage array, without intervention by a file man- 
ager. More storage nodes may be added to incre- 
mentally scale bandwidth, capacity, and disk arms. 


The Slice block storage prototype is loosely based 
on a proposal in the National Storage Industry 
Consortium (NSIC) for object-based storage devices 
(OBSD) [3]. Key elements of the OBSD proposal 
were in turn inspired by the CMU research on Net- 
work Attached Secure Disks (NASD) [8, 9]. Slice 
storage nodes are “object-based” rather than sector- 
based, meaning that requesters address data as log- 
ical offsets within storage objects. A storage object 
is an ordered sequence of bytes with a unique iden- 
tifier. The placement policies of the file service are 
responsible for distributing data among storage ob- 
jects so as to benefit fully from all of the resources 
in the network storage array. 


A key advantage of OBSDs and NASDs is that they 
allow for cryptographic protection of storage object 
identifiers if the network is insecure [9]. This protec- 
tion allows the yproxy to reside outside of the server 
ensemble’s trust boundary. In this case, the dam- 
age from a compromised pproxy is limited to the 
files and directories that its client(s) had permis- 
sion to access. However, the Slice request routing 
architecture is compatible with conventional sector- 
based storage devices if every proxy resides inside 
the service trust boundary. 


This storage architecture is orthogonal to the ques- 
tion of which level arranges redundancy to tolerate 
disk failures. One alternative is to provide redun- 
dancy of disks and other vulnerable components in- 
ternally to each storage node. A second option is for 
the file service software to mirror data or maintain 
parity across the storage nodes. In Slice, the choice 
to employ extra redundancy across storage nodes 
may be made on a per-file basis through support 
for mirrored striping in our prototype’s I/O routing 
policies. For stronger protection, a Slice configura- 
tion could employ redundancy at both levels. 


The Slice block service includes a coordinator mod- 
ule for files that span multiple storage nodes. The 
coordinator manages optional block maps (Sec- 
tion 3.1) and preserves atomicity of multisite op- 
erations {Section 3.3.2). A Slice configuration may 
include any number of coordinators, each managing 
a subset of the files (Section 4.2). 


2.3. File Managers 


File management functions above the network stor- 
age array are split across two classes of file man- 
agers. Each class governs functions that are com- 
mon to any file server; the architecture separates 
them to distribute the request load and allow im- 
plementations specialized for each request class. 


e Directory servers handle name space opera- 
tions, e.g., to create, remove, or lookup files and 
directories by symbolic name; they manage di- 
rectories and mappings from names to identi- 
fiers and attributes for each file or directory. 


e Small-file servers handle read and write opera- 
tions on small files and the initial segments of 
large files (Section 3.1). 


Slice file managers are dataless; all of their state is 
backed by the network storage array. Their role is to 
aggregate their structures into larger storage objects 
backed by the storage nodes, and to provide memory 
and CPU resources to cache and manipulate those 
structures. In this way, the file managers can benefit 
from the parallel disk arms and high bandwidth of 
the storage array as more storage nodes are added. 


The principle of dataless file managers also plays a 
key role in recovery. In addition to its backing ob- 
jects, each manager journals its updates in a write- 
ahead log [10]; the system can recover the state of 
any manager from its backing objects together with 
its log. This allows fast failover, in which a surviving 
site assumes the role of a failed server, recovering its 
state from shared storage [12, 4, 24]. 


2.4 Summary 


Interposed request routing in the Slice architecture 
yields three fundamental benefits: 


e Scalable file management with content-based re- 
quest switching. Slice distributes file service re- 
quests across a server ensemble. A good request 
switching scheme induces a balanced distribu- 
tion of file objects and requests across servers, 
and improves locality in the request stream. 


e Direct storage access for high-volume I/O. The 
pproxy routes bulk I/O traffic directly to the 
network storage array, removing the file man- 
agers from the critical path. Separating re- 
quests in this fashion eliminates a key scaling 
barrier for conventional file services [8, 9]. At 
the same time, the small-file servers absorb and 
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aggregate I/O operations on small files, so there 
is no need for the storage nodes to handle small 
objects efficiently. 


Compatibility with standard file system clients. 
The pproxy factors request routing policies out 
of the client-side file system code. This allows 
the architecture to leverage a minimal comput- 
ing capability within the network elements to 
virtualize the storage protocol. 


2.5 Related Work 


A large number of systems have interposed new sys- 
tem functionality by “wrapping” an existing inter- 
face, including kernel system calls [14], internal in- 
terfaces [13], communication bindings [11], or mes- 
saging endpoints. The concept of a proxy mediating 
between clients and servers [23] is now common in 
distributed systems. We propose to mediate some 
storage functions by interposing on standard storage 
access protocols within the network elements. Net- 
work file services can benefit from this technique be- 
cause they have well-defined protocols and a large 
installed base of clients and applications, many of 
which face significant scaling challenges today. 


The Slice proxy routes file service requests based 
on their content. This is analogous to the HTTP 
content switching features offered by some net- 
work switch vendors (e.g., Alteon, Arrowpoint, F5), 
based in part on research demonstrating improved 
locality and load balancing for large Internet server 
sites [20]. Slice extends the content switching con- 
cept to a file system context. 


A number of recent commercial and research ef- 
forts investigate techniques for building scalable 
storage systems for high-speed switched LAN net- 
works. These system are built from disks dis- 
tributed through the network, and attached to ded- 
icated servers [16, 24, 12], cooperating peers [4, 26], 
or the network itself [8, 9]. We separate these sys- 
tems into two broad groups. 


The first group separates file managers (e.g., the 
name service) from the block storage service, as in 
Slice. This separation was first proposed for the 
Cambridge Universal File Server [6]. Subsequent 
systems adopted this separation to allow bulk I/O 
to bypass file managers [7, 12], and it is now a basic 
tenet of research in network-attached storage de- 
vices including the CMU NASD work on devices for 
secure storage objects [8, 9]. Slice shows how to 
incorporate placement and routing functions essen- 
tial for this separation into a new filesystem struc- 
ture for network-attached storage. The CMU NASD 


project integrated similar functions into network 
file system clients [9]; the Slice model decouples 
these functions, preserving compatibility with ex- 
isting clients. In addition, Slice extends the NASD 
project approach to support scalable file manage- 
ment as well as high-bandwidth I/O for large files. 


A second group of scalable storage systems lay- 
ers the file system functions above a network stor- 
age volume using a shared disk model. Policies 
for striping, redundancy, and storage site selection 
are specified on a volume basis; cluster nodes coor- 
dinate their accesses to the shared storage blocks 
using an ownership protocol. This approach has 
been used with both log-structured (Zebra [12] and 
xFS [4]) arid conventional (Frangipani/Petal [16, 24] 
and GFS [21]) file system organizations. The clus- 
ter may be viewed as “serverless” if all nodes are 
trusted and have direct access to the shared disk, 
or alternatively the entire cluster may act as a file 
server to untrusted clients using a standard network 
file protocol, with all I/O passing through the clus- 
ter nodes as they mediate access to the disks. 


The key benefits of Slice request routing apply 
equally to these shared disk systems when untrusted 
clients are present. First, request routing is a key to 
incorporating secure network-attached block stor- 
age, which allows untrusted clients to address stor- 
age objects directly without compromising the in- 
tegrity of the file system. That is, a wproxy could 
route bulk I/O requests directly to the devices, 
yielding a more scalable system that preserves com- 
patibility with standard clients and allows per-file 
policies for block placement, parity or replication, 
prefetching, etc. Second, request routing enhances 
locality in the request stream to the file servers, im- 
proving cache effectiveness and reducing block con- 
tention among the servers. 


The shared disk model is used in many commercial 
systems, which increasingly interconnect storage de- 
vices and servers with dedicated Storage Area Net- 
works (SANs), e.g., FibreChannel. This paper ex- 
plores storage request routing for Internet networks, 
but the concepts are equally applicable in SANs. 


Our proposal to separate small-file I/O from the re- 
quest stream is similiar in concept to the Amoeba 
Bullet Server [25], a specialized file server that op- 
timizes small files. As described in Section 4.4, 
the prototype small-file server draws on techniques 
from the Bullet Server, FFS fragments [19], and 
SquidMLA [18], a Web proxy server that maintains 
a user-level “filesystem” of small cached Web pages. 
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3 Request Routing Policies 


This section explains the structure of the wproxy 
and the request routing schemes used in the Slice 
prototype. The purpose is to illustrate concretely 
the request routing policies enabled by the architec- 
ture, and the implications of those policies for the 
way the servers interact to maintain and recover 
consistent file system states. We use the NFS V3 
protocol as a reference point because it is widely 
understood and our prototype supports it. 


The pproxy intercepts NFS requests addressed to 
virtual NFS servers, and routes the request to a 
physical server by applying a function to the re- 
quest type and arguments. It then rewrites the IP 
address and port to redirect the request to the se- 
lected server. When a response arrives, the proxy 
rewrites the source address and port before forward- 
ing it to the client, so the response appears to orig- 
inate from the virtual NFS server. 


The request routing functions must permit recon- 
figuration to add or remove servers, while minimiz- 
ing state requirements in the proxy. The pproxy 
directs most requests by extracting relevant fields 
from the request, perhaps hashing to combine mul- 
tiple fields, and interpreting the result as a logical 
server site ID for the request. It then looks up the 
corresponding physical server in a compact routing 
table. Multiple logical sites may map to the same 
physical server, leaving flexibility for reconfiguration 
(Section 3.3.1). The routing tables constitute soft 
state; the mapping is determined externally, so the 
pproxy never modifies the tables. 


The pproxy examines up to four fields of each re- 
quest, depending on the policies configured: 


e Request type. Routing policies are keyed by the 
NFS request type, so the wproxy may employ 
different policies for different functions. Table 1 
lists the important NFS request groupings dis- 
cussed in this paper. 


File handle. Each NFS request targets a spe- 
cific file or directory, named by a unique identi- 
fier called a file handle (or fhandle). Although 
NFS fhandles are opaque to the client, their 
structure can be known to the proxy, which 
acts as an extension of the service. Directory 
servers encode a file!D in each fhandle, which 
the proxies extract as a routing key. 


Read/write offset. NFS I/O operations specify 
the range of offsets covered by each read and 


write. The proxy uses these fields to select 
the server or storage node for the data. 


e Name component. NFS name space requests 
include a symbolic name component in their ar- 
guments (see Table 1). A key challenge for scal- 
ing file management is to obtain a balanced dis- 
tribution of these requests. This is particularly 
important for name-intensive workloads with 
small files and heavy create/lookup/remove ac- 
tivity, as often occurs in Internet services for 
mail, news, message boards, and Web access. 


We now outline some pproxy policies that use these 
fields to route specific request groups. 


3.1 Block I/O 


Request routing for read/write requests have two 
goals: separate small-file read/write traffic from 
bulk I/O, and decluster the blocks of large files 
across the storage nodes for the desired access prop- 
erties (e.g., high bandwidth or a specified level of 
redundancy). We address each in turn. 


When small-file servers are configured, the proto- 
type’s routing policy defines a fixed threshold offset 
(e.g., 64KB); the proxy directs I/O requests be- 
low the threshold to a small-file server selected from 
the request fhandle. The threshold offset is neces- 
sary because the size of each file may change at any 
time. Thus the small-file servers also receive a sub- 
set of the I/O requests on large files; they receive 
all I/O below the threshold, even if the target file 
is large. In practice, large files have little impact 
on the small-file servers because there tends to be 
a small number of these files, even if they make up 
a large share of the stored bytes. Similarly, large 
file I/O below the threshold is limited by the band- 
width of the small-file server, but this affects only 
the first threshold bytes, and becomes progressively 
less significant as the file grows. 


The pproxy redirects I/O traffic above the thresh- 
old directly to the network storage array, using some 
placement policy to select the storage site(s) for each 
block. A simple option is to employ static strip- 
ing and placement functions that compute on the 
block offset and/or fileID. More flexible placement 
policies would allow the pproxy to consider other 
factors, e.g., load conditions on the network or stor- 
age nodes, or file attributes encoded in the fhandle. 
To generalize to more flexible placement policies, 
Slice optionally records block locations in per-file 
block maps managed by the block service coordina- 
tors. The pproxies interact with the coordinators 
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Name Space Operations 
lookup (dir, name) returns (fhandle, attr) Look up a name in dér, return handle and attributes. 





remoue(dir, name), rmdir(dir, name) 


link(olddir, oldname, newdir, newname) 
returns (fhandle, attr) 

rename(olddit, oldname, newdir, newname) 
returns (fhandle, attr) 


getattr(object) returns (attr) 
setattr(object, attr) 









readdir(dir, cookie) returns (entries, cookie) 





i 

| create(dir, name) returns (fhandle, attr) Create a file/directory and update the parent entry/link 
mkdir(dir, name) returns (fhandle, attr) 

Y Remove a file/directory or hard link and update the parent 

reate a new name for a file, update the file link count, 

Rename an existing file or hard link; update the link count 

Attribute Operations 

Retrieve the attributes of a file or directory. 


Modify the attributes of a file or directory, and update its 
modify timestamp. 


I/O Operations 
read(file, offset, len) returns (data, attr) Read data from a file, updating its access timestamp. 
write(file, offset, len) returns (data, attr) Write data to a file, updating its modify timestamp. 


Directory Retrival 


Read some or all of the entries in a directory. 










Table 1: Some important Network File System (NFS) protocol operations. 


to fetch and cache fragments of the block maps as 
they handle I/O operations on files. 


As one example of an attribute-based policy, Slice 
supports a mirrored striping policy that replicates 
each block of a mirrored file on multiple storage 
nodes, to tolerate failures up to the replication de- 
gree. Mirroring consumes more storage and net- 
work bandwidth than striping with parity, but it is 
simple and reliable, avoids the overhead of comput- 
ing and updating parity, and allows load-balanced 
reads [5, 16]. 


3.2 Name Space Operations 


Effectively distributing mame space _ requests 
presents different challenges from I/O request rout- 
ing. Name operations involve more computation, 
and name entries may benefit more from caching 
because they tend to be relatively small and 
fragmented. Moreover, directories are frequently 
shared. Directory servers act as synchronization 
points to preserve integrity of the name space, e.g., 
to prevent clients from concurrently creating a file 
with the same name, or removing a directory while 
a name create is in progress. 


A simple approach to scaling a file service is to parti- 
tion the name space into a set of volumes, each man- 
aged by a single server. Unfortunately, this VOLUME 
PARTITIONING strategy compromises transparency 
and increases administrative overhead in two ways. 
First, volume boundaries are visible to clients as 
mount points, and naming operations such as link 
and rename cannot cross volume boundaries. Sec- 
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ond, the system develops imbalances if volume loads 
grow at different rates, requiring intervention to 
repartition the name space. This may be visible to 
users through name changes to existing directories. 


An important goal of name management in Slice 
is to automatically distribute the load of a single 
file volume across multiple servers, without impos- 
ing user-visible volume boundaries. We propose two 
alternative name space routing policies to achieve 
this goal. MKDIR SWITCHING yields balanced dis- 
tributions when the average number of active di- 
rectories is large relative to the number of direc- 
tory server sites, but it binds large directories to a 
single server. For workloads with very large direc- 
tories, NAME HASHING yields probabilistically bal- 
anced request distributions independent of work- 
load. The cost of this effectiveness is that more 
operations cross server boundaries, increasing the 
cost and complexity of coordination among the di- 
rectory servers (Section 4.3). 


MKDIR SWITCHING works as follows. In most cases, 
the yproxy routes name space operations to the di- 
rectory server that manages the parent directory; 
the yproxy identifies this server by indexing its rout- 
ing table with the fileID from the parent directory 
fhandle in the request (refer to Table 1). On a mkdir 
request, the proxy decides with probability p to 
redirect the request to a different directory server, 
placing the new directory — and its descendents — 
on a different site from the parent directory. The 
policy uniquely selects the new server by hashing 
on the parent fhandle and the symbolic name of the 
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new directory; this guarantees that races over name 
manipulation involve at most two sites. Reducing 
directory affinity by increasing p makes the policy 
more aggressive in distributing name entries across 
sites; this produces a more balanced load, but more 
operations involve multiple sites. Section 5 presents 
experimental data illustrating this tradeoff. 


NAME HASHING extends this approach by routing 
all name space operations using a hash on the name 
component and its position in the directory tree, 
as given by the parent directory fhandle. This ap- 
proach represents the entire volume name space as 
a unified global hash table distributed among the 
directory servers. It views directories as distributed 
collections of name entries, rather than as files ac- 
cessed as a unit. Conflicting operations on any given 
name entry (e.g., create/create, create/remove, re- 
move/lookup) always hash to the same server, where 
they serialize on the shared hash chain. Operations 
on different entries in the same directory (e.g., cre- 
ate, remove, lookup) may proceed in parallel at mul- 
tiple sites. For good performance, NAME HASHING 
requires sufficient memory to keep the hash chains 
memory-resident, since the hashing function sacri- 
fices locality in the hash chain accesses. Also, read- 
dir operations span multiple sites; this is the right 
behavior for large directories, but it increases read- 
dir costs for small directories. 


3.3. Storage Service Structure 


Request routing policies impact storage service 
structure. The primary challenges are coordination 
and recovery to maintain a consistent view of the 
file volume across all servers, and reconfiguration to 
add or remove servers within each class. 


Most of the routing policies outlined above are in- 
dependent of whether small files and name entries 
are bound to the server sites that create them. One 
option is for the servers to share backing objects 
from a shared disk using a block ownership proto- 
col (see Section 2.5); in this case, the role of the 
puproxy is to enhance locality in the request stream 
to each server. Alternatively, the system may use 
fized placement in which items are controlled by 
their create sites unless reconfiguration or failover 
causes them to move; with this approach backing 
storage objects may be private to each site, even if 
they reside on shared network storage. Fixed place- 
ment stresses the role of the request routing pol- 
icy in the placement of new name entries or data 
items. The next two subsections discuss reconfigu- 
ration and recovery issues for the Slice architecture 
with respect to these structural alternatives. 


3.3.1 Reconfiguration 


Consider the problem of reconfiguration to add or 
remove file managers, i.e., directory servers, small- 
file servers, or map coordinators. For requests 
routed by keying on the fileID, the system updates 
pproxy routing tables to change the binding from 
fileI[Ds to physical servers if servers join or depart 
the ensemble. To keep the tables compact, Slice 
maps the fileID to a smaller logical server ID before 
indexing the table. The number of logical servers 
defines the size of the routing tables and the mini- 
mal granularity for rebalancing. The wproxy’s copy 
of the routing table is a “hint” that may become 
stale during reconfiguration; the wproxy may load 
new tables lazily from an external source, assuming 
that servers can identify misdirected requests. 


This approach generalizes to policies in which the 
logical server ID is derived from a hash that includes 
other request arguments, as in the NAME HASHING 
approach. For NAME HASHING systems and other 
systems with fixed placement, the reconfiguration 
procedure must move logical servers from one phys- 
ical server to another. One approach is for each 
physical server to use multiple backing objects, one 
for each hosted logical server, and reconfigure by re- 
assigning the binding of physical servers to backing 
objects in the shared network storage array. Other- 
wise, reconfiguration must copy data from one back- 
ing object to another. In general, an ensemble with 
N servers must move 1/Nth of its data to rebalance 
after adding or losing a physical server [15]. 


3.3.2 Atomicity and Recovery 


File systems have strong integrity requirements and 
frequent updates; the system must preserve their in- 
tegrity through failures and concurrent operations. 
The focus on request routing naturally implies that 
the multiple servers must manage distributed state. 


File managers prepare for recovery by generating a 
write-ahead log in shared storage. For systems that 
use the shared-disk model without fixed placement, 
all operations execute at a single manager site, and 
it is necessary and sufficient for the system to pro- 
vide locking and recovery procedures for the shared 
disk blocks [24]. For systems with fixed placement, 
servers do not share blocks directly, but some oper- 
ations must update state at multiple sites through 
a peer-peer protocol. Thus there is no need for dis- 
tributed locking or recovery of individual blocks, but 
the system must coordinate logging and recovery 
across sites, e.g., using two-phase commit. 
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For MKDIR SWITCHING, the operations that update 
multiple sites are those involving the “orphaned” 
directories that were placed on different sites from 
their parents. These operations include the redi- 
rected mkdirs themselves, associated rmdirs, and 
any rename operations involving the orphaned en- 
tries. Since these operations are relatively infre- 
quent, as determined by the redirection probability 
parameter p, it is acceptable to perform a full two- 
phase commit as needed to guarantee their atom- 
icity on systems with fixed placement. However, 
NAME HASHING requires fixed placement — un- 
less the directory servers support fine-grained dis- 
tributed caching — and any name space update in- 
volves multiple sites with probability (N — 1)/N 
or higher. While it is possible to reduce commit 
costs by logging asynchronously and coordinating 
rollback, this approach weakens failure properties 
because recently completed operations may be lost 
in a failure. 


Shared network storage arrays present their own 
atomicity and recovery challenges. In Slice, the 
block service coordinators preserve atomicity of op- 
erations involving multiple storage nodes, including 
mirrored striping, truncate/remove, and NFS V3 
write commitment (commit). Amiri et al. [1] ad- 
dresses atomicity and concurrency control issues for 
shared storage arrays; the Slice coordinator proto- 
col complements {1] with an intention logging pro- 
tocol for atomic filesystem operations [2]. The basic 
protocol is as follows. At the start of the opera- 
tion, the wproxy sends to the coordinator an inten- 
tion to perform the operation. The coordinator logs 
the intention to stable storage. When the opera- 
tion completes, the proxy notifies the coordinator 
with a completion message, asynchronously clearing 
the intention. If the coordinator does not receive 
the completion within some time bound, it probes 
the participants to determine if the operation com- 
pleted, and initiates recovery if necessary. A failed 
coordinator recovers by scanning its intentions log, 
completing or aborting operations in progress at the 
time of the failure. In practice, the protocol elimi- 
nates some message exchanges and log writes from 
the critical path of most common-case operations 
by piggybacking messages, leveraging the NFS V3 
commit semantics, and amortizing intention logging 
costs across multiple operations. 


4 Implementation 


The Slice prototype is a set of loadable kernel mod- 
ules for the FreeBSD operating system. The pro- 
totype includes a proxy implemented as a packet 


filter below the Internet Protocol (IP) stack, and 
kernel modules for the basic server classes: block 
storage service and block storage coordinator, di- 
rectory server, and small-file server. A given server 
node may be configured for any subset of the Slice 
server functions, and each function may be present 
at an arbitrary number of nodes. The following sub- 
sections discuss each element of the Slice prototype 
in more detail. 


4.1 The pproxy 


The Slice pproxy is a loadable packet filter module 
that intercepts packets exchanged with registered 
NFS virtual server endpoints. The module is con- 
figurable to run as an intermediary at any point 
in the network between a client and the server en- 
semble, preserving compatibility with NFS clients. 
Our premise is that the functions of the proxy are 
simple enough to integrate more tightly with the 
network switching elements, enabling wire-speed re- 
quest routing. The pproxy may also be configured 
below the IP stack on each client node, to avoid the 
store-and-forward delays imposed by host-based in- 
termediaries in our prototype. 


The pproxy is anonblocking state machine with soft 
state consisting of pending request records and rout- 
ing tables for I/O redirection, MKDIR SWITCHING, 
and NAME HASHING, as described in Section 3. The 
prototype statically configures the policies and ta- 
ble sizes for name space operations and small-file 
I/O; it does not yet detect and refresh stale rout- 
ing tables for reconfiguration. These policies use the 
MD5 [22] hash function; we determined empirically 
that MD5 yields a combination of balanced distri- 
bution and low cost that is superior to competing 
hash functions available to us. For reads and writes 
beyond the threshold offset the uproxy may use ei- 
ther a static block placement policy or a local cache 
of per-file block maps supplied by a block service 
coordinator (see Section 4.2). 


The pproxy also maintains a cache over file at- 
tribute blocks returned in NFS responses from the 
servers. Directory servers maintain the authorita- 
tive attributes for files; the system must keep these 
attributes current to reflect I/O traffic to the block 
storage nodes, which affects the modify time, ac- 
cess time, and/or size attributes of the target file. 
The pproxy updates these attributes in its cache as 
each operation completes, and returns a complete 
set of attributes to the client in each response (some 
clients depend on this behavior, although the NFS 
specification does not require it). The proxy gen- 
erates an NFS setattr operation to push modified at- 
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tributes back to the directory server when it evicts 
attributes from its cache, or when it intercepts an 
NFS V3 write commit request from the client. Most 
clients issue commit requests for modified files from 
a periodic system update daemon, and when a user 
process calls fsync or close on a modified file. 


The prototype may yield weaker attribute consis- 
tency than some NFS implementations. First, at- 
tribute timestamps are no longer assigned at a cen- 
tral site; we rely on the Network Time Protocol 
(NTP) to keep clocks synchronized across the sys- 
tem. Most NFS installations already use NTP to 
allow consistent assignment and interpretation of 
timestamps across multiple servers and clients. Sec- 
ond, a read or an uncommitted write is not guar- 
anteed to update the attribute timestamps if the 
pproxy fails and loses its state. In the worst case 
an uncommitted write might complete at a stor- 
age node but not affect the modify time at all (if 
the client also fails before reissuing the write). The 
NFS V3 specification permits this behavior: uncom- 
mitted writes may affect any subset of the modified 
data or attributes. Third, although the attribute 
timestamps cached and returned by each pproxy 
are always current with respect to operations from 
clients bound to that proxy, they may drift be- 
yond the “three second window” that is the de facto 
standard in NFS implementations for concurrently 
shared files. We consider this to be acceptable since 
NFS V3 offers no firm consistency guarantees for 
concurrently shared files anyway. Note, however, 
that NFS V4 proposes to support consistent file 
sharing through a leasing mechanism similar to NQ- 
NFS [17]; it will then be sufficient for the wproxy to 
propagate file attributes when a client renews or re- 
linquishes a lease for the file. The current proxy 
bounds the drift by writing back modified attributes 
at regular intervals. 


Since the proxy modifies the contents of request 
and response packets, it must update the UDP or 
TCP checksums to match the new packet data. The 
prototype proxy recomputes checksums incremen- 
tally, generalizing a technique used in other packet 
rewriting systems. The uproxy’s differential check- 
sum code is derived from the FreeBSD implemen- 
tation of Network Address Translation (NAT). The 
cost of incremental checksum adjustment is propor- 
tional to the number of modified bytes and is in- 
dependent of the total size of the message. It is 
efficient because the proxy rewrites at most the 
source or destination address and port number, and 
in some cases certain fields of the file attributes. 


4.2 Block Storage Service 


The Slice block storage servers use a kernel mod- 
ule that exports disks to the network. The stor- 
age nodes serve a flat space of storage objects 
named by unique identifiers; storage is addressed by 
(object, logicalblock), with physical allocation con- 
trolled by the storage node software as described 
in Section 2.2. The key operations are a subset 
of NFS, including read, write, commit, and remove. 
The storage nodes accept NFS file handles as object 
identifiers, using an external hash to map them to 
storage objects. Our current prototype uses the Fast 
File System (FFS) as a storage manager within each 
storage node. The storage nodes prefetch sequential 
files up to 256 KB beyond the current access, and 
also leverage FFS write clustering. 


The block storage service includes a coordinator im- 
plemented as an extension to the storage node mod- 
ule. Each coordinator manages a set of files, selected 
by fileID. The coordinator maintains optional per- 
file block maps giving the storage site for each logi- 
cal block of the file; these maps are used for dynamic 
1/O routing policies (Section 3.1). The coordinator 
also implements the intention logging protocol to 
preserve failure atomicity for file accesses involving 
multiple storage sites (Section 3.3.2), including re- 
move/truncate, consistent write commitment, and 
mirrored writes, as described in [2]. The coordina- 
tor backs its intentions log and block maps within 
the block storage service using a static placement 
function. A more failure-resilient implementation 
would employ redundancy across storage nodes. 


4.3 Directory Servers 


Our directory server implementations use fixed 
placement and support both the NAME HASHING 
and MKDIR SWITCHING policies. The directory 
servers store directory information as webs of linked 
fixed-size cells representing name entries and file at- 
tributes, allocated from memory zones backed by 
the block storage service. These cells are indexed 
by hash chains keyed by an MD5 hash fingerprint 
on the parent file handle and name. The directory 
servers place keys in each newly minted file handle, 
allowing them to locate any resident cell if presented 
with an fhandle or an (fhandle,name) pair. At- 
tribute cells may include a remote key to reference 
an entry on another server, enabling cross-site links 
in the directory structure. Thus the name entries 
and attribute cells for a directory may be distributed 
arbitrarily across the servers, making it possible to 
support both NAME HASHING and MKDIR SWITCH- 
ING policies easily within the same code base. 
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data file 


Figure 2: Small-file server data structures. 


Given the distribution of entries across directory 
servers, some NFS operations involve multiple sites. 
The pproxy interacts with a single site for each re- 
quest. Directory servers use a simple peer-peer pro- 
tocol to update link counts for create/link/remove 
and mkdir/rmdir operations that cross sites, and 
to follow cross-site links for lookup, getattr/setattr, 
and readdir. For NAME HASHING we implemented 
rename as a link followed by a remove. 


Support for recovery and reconfiguration is incom- 
plete in our prototype. Directory servers log their 
updates, but the recovery procedure itself is not im- 
plemented, nor is the support for shifting ownership 
of blocks and cells across servers. 


4.4 Small-file Servers 


The small-file server is implemented by a module 
that manages each file as a sequence of 8KB logical 
blocks. Figure 2 illustrates the key data structures 
and their use for a read or write request. The lo- 
cations for each block are given by a per-file map 
record. The server accesses this record by index- 
ing an on-disk map descriptor array using the fileID 
from the fhandle. Like the directory server, storage 
for small-file data is allocated from zones backed by 
objects in the block storage service. 


Each map record gives a fixed number of (off- 
set,length) pairs mapping 8KB file extents to re- 
gions within a backing object. Each logical block 
may have less than the full 8KB of physical space 
allocated for it; physical storage for a block rounds 
the space required up to the next power of two to 
simplify space management. New files or writes to 
empty segments are allocated space according to 
best fit, or if no good fragment is free, a new re- 
gion is allocated at the end of the backing storage 
object. The best-fit variable fragment approach is 
similar to SquidMLA [18]. 


This structure allows efficient space allocation and 
supports file growth. For example, a 8300 byte file 
would consume only 8320 bytes of physical storage 
space, 8192 bytes for the first block, and 128 for the 
remaining 108 bytes. Under a create-heavy work- 
load, the small-file allocation policy lays out data on 
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backing objects sequentially, batching newly created 
files into a single stream for efficient disk writes. The 
small-file servers comply with the NFS V3 commit 
specification for writes below the threshold offset. 


Map records and data from the small-file server 
backing objects are cached in the kernel file buffer 
cache. This structure performs well if file accesses 
and the assignment of file[Ds show good locality. In 
particular, if the directory servers assign file[Ds with 
good spatial locality, and if files created together 
are accessed together, then the cost of reading the 
map records is amortized across multiple files whose 
records fit in a single block. 


5 Performance 


This section presents experimental results from the 
Slice prototype to show the overheads and scaling 
properties of the interposed request routing archi- 
tecture. We use synthetic benchmarks to stress dif- 
ferent aspects of the system, then evaluate whole- 
system performance using the industry-standard 
SPECsfs97 workload. 


The storage nodes for the test ensemble are Dell 
PowerEdge 4400s with a 733 MHz Pentium-III Xeon 
CPU, 256MB RAM, and a ServerWorks LE chipset. 
Each storage node has eight 18GB Seagate Cheetah 
drives (ST318404LC) connected to a dual-channel 
Ultra-160 SCSI controller. Servers and clients are 
450 MHz Pentium-HI PCs with 512MB RAM and 
Asus P2B motherboards using a 440BX chipset. 
The machines are linked by a Gigabit Ethernet net- 
work with Alteon ACEnic 710025 adapters and a 
32-port Extreme Summit-7i switch. The switch and 
adapters use 9KB (“Jumbo”) frames; the adapters 
run locally modified firmware that supports header 
splitting for NFS traffic. The adapters occupy a 
64-bit/66 MHz PCI slot on the Dell 4400s, and a 
32-bit/33 MHz PCI slot on the PCs. All kernels are 
built from the same FreeBSD 4.0 source pool. 


saturation 


437 MB/s 
479 MB/s 
222 MB/s 
251 MB/s 


single client 


62.5 MB/s 
38.9 MB/s 


read 

write 
read-mirrored 
write-mirrored 


52.9 MB/s 
32.2 MB/s 





Table 2: Bulk I/O bandwidth in the test ensemble. 


Read/write performance. Table 2 shows raw 
read and write bandwidth for large files. Each test 
(dd) issues read or write system calls on a 1.25 GB 
file in a Slice volume mounted with a 32KB NFS 
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block size and a read-ahead depth of four blocks. 
The pproxies use a static I/O routing function to 
stripe large-file data across the storage array. We 
measure sequential access bandwidth for unmirrored 
files and mirrored files with two replicas. 


The left column of Table 2 shows the I/O band- 
width driven by a single PC client. Writes saturate 
the client CPU below 40 MB/s, the maximum band- 
width achievable through the FreeBSD NFS/UDP 
client stack in this configuration. We modified 
the FreeBSD client for zero-copy reading, allowing 
higher bandwidth with lower CPU utilization; in 
this case, performance is limited by a prefetch depth 
bound in FreeBSD. Mirroring degrades read band- 
width because the client proxies alternate between 
the two mirrors to balance the load, leaving some 
prefetched data unused on the storage nodes. Mir- 
roring degrades write bandwidth because the client 
host writes to both mirrors. 


The right column of Table 2 shows the aggregate 
bandwidth delivered to eight clients, saturating the 
storage node I/O systems. Each storage node 
sources reads to the network at 55 MB/s and sinks 
writes at 60 MB/s. While the Cheetah drives each 
yield 33 MB/s of raw bandwidth, achievable disk 
bandwidth is below 75 MB/s per node because the 
4400 backplane has a single SCSI channel for all of 
its internal drive bays, and the FreeBSD 4.0 driver 
runs the channel in Ultra-2 mode because it does 
not yet support Ultra-160. 


Operation 


Packet interception 
Packet decode 
Redirection/rewriting | 0.5% 
Soft state logic 





Table 3: uzproxy CPU cost for 6250 packets/second. 


Overhead of the pproxy. The interposed re- 
quest routing architecture is sensitive to the costs 
to intercept and redirect file service protocol pack- 
ets. Table 3 summarizes the CPU overheads for a 
client-based proxy under a synthetic benchmark 
that stresses name space operations, which place 
the highest per-packet loads on the proxy. The 
benchmark repeatedly unpacks (untar) a set of zero- 
length files in a directory tree that mimics the 
FreeBSD source distribution. Each file create gen- 
erates seven NFS operations: lookup, access, create, 
getattr, lookup, setattr, setattr. We used iprobe (In- 
struction Probe), an on-line profiling tool for Alpha- 
based systems, to measure the yproxy CPU cost on 


a 500 MHz Compaq 21264 client (4MB L2). This 
untar workload generates mixed NFS traffic at a 
rate of 3125 request/response pairs per second. 


The client spends 6.1% of its CPU cycles in the 
pproxy. Redirection replaces the packet destination 
and/or ports and restores the checksum as described 
in Section 4.1, consuming a modest 0.5% of CPU 
time. The cost of managing soft state for attribute 
updates and response pairing accounts for 0.8%. 
The most significant cost is the 4.1% of CPU time 
spent decoding the packets to prepare for rewrit- 
ing. Nearly half of the cost is to locate the off- 
sets of the NFS request type and arguments; NFS 
V3 and ONC RPC headers each include variable- 
length fields (e.g., access groups and the NFS V3 file 
handle) that increase the decoding overhead. Mi- 
nor protocol changes could reduce this complexity. 
While this complexity affects the cost to implement 
the proxy in network elements, it does not limit 
the scalability of the Slice architecture. 
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Figure 3: Directory service scaling. 


Directory service scaling. We used the name- 
intensive untar benchmark to evaluate scalability 
of the prototype directory service using the NAME 
HASHING and MKDIR SWITCHING policies. For 
MKDIR SWITCHING we chose p = 1/N, ie., the 
pproxy redirects 1/Nth of the mkdir requests to dis- 
tribute the directories across the N server sites. In 
this test, a variable number of client processes exe- 
cute the untar benchmark on five client PCs. Each 
process creates 36,000 files and directories gener- 
ating a total of 250,000 NFS operations. For this 
experiment, in which the name space spans many 
directories, MKDIR SWITCHING and NAME HASHING 
perform identically. 


Figure 3 shows the average total latency perceived 
by each client process as a function of the num- 
ber of processes. We use multiple client nodes to 
avoid client saturation, and vary the number of di- 
rectory servers; each line labeled “Slice-N” has N 


4th Symposium on Operating Systems Design and Implementation 


269 


PCs acting as directory servers. For comparison, 
the N-MFS line measures an NFS server exporting 
a memory-based file system (FreeBSD MFS). MFS 
initially performs better due to Slice logging and 
update traffic, but the MFS server’s CPU quickly 
saturates with more clients. In contrast, the Slice 
request routing schemes spread the load among mul- 
tiple directory servers, and both schemes show good 
scaling behavior with more servers. Each server sat- 
urates at 6000 ops/s generating about 0.5 MB/s of 
log traffic. 
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Figure 4: Impact of affinity for MKDIR SWITCHING. 


Figure 4 shows the effect of varying directory affin- 
ity (1 — p) for MKDIR SWITCHING under the name- 
intensive untar workload. The X-axis gives the 
probability 1 — p that a new directory is placed 
on the same server as its parent; the Y-axis shows 
the average untar latency observed by the clients. 
This test uses four client nodes hosting one, four, 
eight, or sixteen client processes against four direc- 
tory servers. For light workloads, latency is unaf- 
fected by affinity, since a single server can handle the 
load. For heavier workloads, increasing directory 
affinity along the X-axis initially yields a slight im- 
provement as the number of cross-server operations 
declines. Increasing affinity toward 100% ultimately 
degrades performance due to load imbalances. This 
simple experiment indicates that MKDIR SWITCH- 
ING can produce even distributions while redirect- 
ing fewer than 20% of directory create requests. A 
more complete study is needed to determine the best 
parameters under a wider range of workloads. 


Overall performance and scalability. We 
now report results from SPECsfs97, the industry- 
standard benchmark for network-attached storage. 
SPECsfs97 runs as a group of workload generator 
processes that produce a realistic mix of NFS V3 
requests, check the responses against the NFS stan- 
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Figure 5: SPECsfs97 throughput at saturation. 


dard, and measure latency and delivered throughput 
in I/O operations per second (IOPS). SPECsfs is de- 
signed to benchmark servers but not clients; it sends 
and receives NFS packets from user space without 
exercising the client kernel NFS stack. SPECsfs is a 
demanding, industrial-strength, self-scaling bench- 
mark. We show results as evidence that the pro- 
totype is fully functional, complies with the NFS 
V3 standard, and is independent of any client 
NFS implementation, and to give a basis for judg- 
ing prototype performance and scalability against 
commercial-grade servers. 


The SPECsfs file set is skewed heavily toward small 
files: 94% of files are 64 KB or less. Although small 
files account for only 24% of the total bytes accessed, 
most SPECsfs I/O requests target small files; the 
large files serve to “pollute” the disks. Thus satura- 
tion throughput is determined largely by the num- 
ber of disk arms. The Slice configurations for the 
SPECsfs experiments use a single directory server, 
two small-file servers, and a varying number of stor- 
age nodes. Figures 5 and 6 report results; lines la- 
beled “Slice-N” use N storage nodes. 


Figure 5 gives delivered throughput for SPECsfs97 
in IOPS as a function of offered load. As a baseline, 
the graph shows the 850 IOPS saturation point of 
a single FreeBSD 4.0 NFS server on a Dell 4400 
exporting its disk array as a single volume (us- 
ing the CCD disk concatenator). Slice-1 yields 
higher throughput than the NFS configuration due 
to faster directory operations, but throughput un- 
der load is constrained by the disk arms. The results 
show that Slice throughput scales with larger num- 
bers of storage nodes, up to 6600 IOPS for eight 
storage nodes with a total of 64 disks. 


Figure 6 gives average request latency as a function 
of delivered throughput. Latency jumps are evi- 
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dent in the Slice results as the ensemble overflows 
its 1 GB cache on the small-file servers, but the 
prototype delivers acceptable latency at all work- 
load levels up to saturation. For comparison, we 
include vendor-reported results from spec.org for a 
recent (4Q99) commercial server, the EMC Celerra 
File Server Cluster Model 506. The Celerra 506 uses 
32 Cheetah drives for data and has 4 GB of cache. 
EMC Celerra is an industry-leading product: it de- 
livers better latency and better throughput than the 
Slice prototype in the nearest equivalent configura- 
tion (Slice-4 with 32 drives), as well as better reli- 
ability through its use of RAID with parity. What 
is important is that the interposed request routing 
technique allows Slice to scale to higher IOPS lev- 
els by adding storage nodes and/or file manager 
nodes to the LAN. Celerra and other commercial 
storage servers are also expandable, but the highest 
IOPS ratings are earned by systems using a VOLUME 
PARTITIONING strategy to distribute load within the 
server. For example, this Celerra 506 exports eight 
separate file volumes. The techniques introduced in 
this paper allow high throughputs without imposing 
volume boundaries; all of the Slice configurations 
serve a single unified volume. 


6 Conclusion 


This paper explores interposed request routing in 
Slice, a new architecture for scalable network- 
attached storage. Slice interposes a simple redi- 
recting uwprozy along the network path between the 
client and an ensemble of storage nodes and file 
managers. The proxy virtualizes a client /server file 
access protocol (e.g., NFS) by applying configurable 
request routing policies to distribute data and re- 
quests across the ensemble. The ensemble nodes 
cooperate to provide a unified, scalable file service. 


The Slice wproxy distributes requests by request 
type and by target object, combining functional de- 


composition and data decomposition of the request 
traffic. We describe two policies for distributing 
name space requests, MKDIR SWITCHING and NAME 
HASHING, and demonstrate their potential to auto- 
matically distribute name space load across servers. 
These techniques complement simple grouping and 
striping policies to distribute file access load. 


The Slice prototype delivers high bandwidth and 
high request throughput on an industry-standard 
NFS benchmark, demonstrating scalability of the 
architecture and prototype. Experiments with a 
simple pproxy packet filter show the feasibility of 
incorporating the request routing features into net- 
work elements. The prototype demonstrates that 
the interposed request routing architecture enables 
incremental construction of powerful distributed 
storage services while preserving compatibility with 
standard file system clients. 


Availability. For more information please visit the 
Web site at http://www.cs.duke.edu/ari/slice. 
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Abstract 


This paper describes an asynchronous state-machine replication 
system that tolerates Byzantine faults, which can be caused 
by malicious attacks or software errors. Our system is the 
first to recover Byzantine-faulty replicas proactively and it 
performs well because it uses symmetric rather than public- 
key cryptography for authentication. The recovery mechanism 
allows us to tolerate any number of faults over the lifetime of 
the system provided fewer than 1/3 of the replicas become 
faulty within a window of vulnerability that is small under 
normal conditions. The window may increase under a denial- 
of-service attack but we can detect and respond to such 
attacks. The paper presents results of experiments showing 
that overall performance is good and that even a small window 
of vulnerability has little impact on service latency. 


1 Introduction 


This paper describes a new system for asynchronous 
state-machine replication [17, 28] that offers both in- 
tegrity and high availability in the presence of Byzan- 
tine faults. Our system is interesting for two reasons: 
it improves security by recovering replicas proactively, 
and it is based on symmetric cryptography, which allows 
it to perform well so that it can be used in practice to 
implement real services. 

Our system continues to function correctly even when 
some replicas are compromised by an attacker; this 
is worthwhile because the growing reliance on online 
information services makes malicious attacks more likely 
and their consequences more serious. The system also 
survives nondeterministic software bugs and software 
bugs due to aging (e.g., memory leaks). Our approach 
improves on the usual technique of rebooting the system 
because it refreshes state automatically, staggers recovery 
so that individual replicas are highly unlikely to fail 
simultaneously, and has little impact on overall system 
performance. Section 4.7 discusses the types of faults 
tolerated by the system in more detail. 

Because of recovery, our system can tolerate any 
number of faults over the lifetime of the system, provided 
fewer than 1/3 of the replicas become faulty within 
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a window of vulnerability. The best that could be 
guaranteed previously was correct behavior if fewer 
than 1/3 of the replicas failed during the lifetime of a 
system. Our previous work [6] guaranteed this and other 
systems [26, 16] provided weaker guarantees. Limiting 
the number of failures that can occur in a finite window 
is a synchrony assumption but such an assumption is 
unavoidable: since Byzantine-faulty replicas can discard 
the service state, we must bound the number of failures 
that can occur before recovery completes. But we 
require no synchrony assumptions to match the guarantee 
provided by previous systems. We compare our approach 
with other work in Section 7. 


The window of vulnerability can be small (e.g., a 
few minutes) under normal conditions. Additionally, our 
algorithm provides detection of denial-of-service attacks 
aimed at increasing the window: replicas can time how 
long a recovery takes and alert their administrator if it 
exceeds some pre-established bound. Therefore, integrity 
can be preserved even when there is a denial-of-service 
attack. 

The paper describes a number of new techniques 
needed to solve the problems that arise when providing 
recovery from Byzantine faults: 


Proactive recovery. A Byzantine-faulty replica may 
appear to behave properly even when broken; therefore 
recovery must be proactive to prevent an attacker from 
compromising the service by corrupting 1/3 of the 
replicas without being detected. Our algorithm recovers 
replicas periodically independent of any failure detection 
mechanism. However a recovering replica may not 
be faulty and recovery must not cause it to become 
faulty, since otherwise the number of faulty replicas could 
exceed the bound required to provide safety. In fact, we 
need to allow the replica to continue participating in the 
request processing protocol while it is recovering, since 
this is sometimes required for it to complete the recovery. 


Fresh messages. An attacker must be prevented from 
impersonating a replica that was faulty after it recovers. 
This can happen if the attacker learns the keys used to 
authenticate messages. Furthermore even if messages 
are signed using a secure cryptographic co-processor, 
an attacker might be able to authenticate bad messages 
while it controls a faulty replica; these messages could 
be replayed later to compromise safety. To solve this 
problem, we define a notion of authentication freshness 
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and replicas reject messages that are not fresh. However, 
this leads to a further problem, since replicas may be 
unable to prove to a third party that some message they 
received is authentic (because it may no longer be fresh). 
All previous state-machine replication algorithms [26, 
16], including the one we described in [6], relied on such 
proofs. Our current algorithm does not, and this has 
the added advantage of enabling the use of symmetric 
cryptography for authentication of all protocol messages. 
This eliminates most use of public-key cryptography, the 
ma jor performance bottleneck in previous systems. 
Efficient state transfer. State transfer is harder in the 
presence of Byzantine faults and efficiency is crucial to 
enable frequent recovery with little impact on perfor- 
mance. To bring a recovering replica up to date, the state 
transfer mechanism checks the local copy of the state to 
determine which portions are both up-to-date and not cor- 
rupt. Then, it must ensure that any missing state it obtains 
from other replicas is correct. We have developed an effi- 
cient hierarchical state transfer mechanism based on hash 
chaining and incremental cryptography [1]; the mecha- 
nism tolerates Byzantine-faults and state modifications 
while transfers are in progress. 


Our algorithm has been implemented as a generic 
program library with a simple interface. This library 
can be used to provide Byzantine-fault-tolerant versions 
of different services. The paper describes experiments 
that compare the performance of a replicated NFS imple- 
mented using the library with an unreplicated NFS. The 
results show that the performance of the replicated sys- 
tem without recovery is close to the performance of the 
unreplicated system. They also show that it is possible 
to recover replicas frequently to achieve a small window 
of vulnerability in the normal case (2 to 10 minutes) with 
little impact on service latency. 


The rest of the paper is organized as follows. Sec- 
tion 2 presents our system model and lists our assump- 
tions; Section 3 states the properties provided by our al- 
gorithm; and Section 4 describes the algorithm. Our im- 
plementation is described in Section 5 and some perfor- 
mance experiments are presented in Section 6. Section 7 
discusses related work. Our conclusions are presented in 
Section 8. 


2 System Model and Assumptions 


We assume an asynchronous distributed system where 
nodes are connected by a network. The network may 
fail to deliver messages, delay them, duplicate them, or 
deliver them out of order. 

We use a Byzantine failure model, 1.e., faulty nodes 
may behave arbitrarily, subject only to the restrictions 
mentioned below. We allow for a very strong adversary 
that can coordinate faulty nodes, delay communication, 
inject messages into the network, or delay correct nodes in 
order to cause the most damage to the replicated service. 
We do assume that the adversary cannot delay correct 
nodes indefinitely. 

We use cryptographic techniques to establish session 
keys, authenticate messages, and produce digests. We use 


the SFS [21] implementation of a Rabin- Williams public- 
key cryptosystem with a 1024-bit modulus to establish 
128-bit session keys. All messages are then authenti- 
cated using message authentication codes (MACs) [2] 
computed using these keys. Message digests are com- 
puted using MDS [27]. 

We assume that the adversary (and the faulty nodes it 
controls) is computationally bound so that (with very high 
probability) it is unable to subvert these cryptographic 
techniques. For example, the adversary cannot forge 
signatures or MACs without knowing the corresponding 
keys, or find two messages with the same digest. The 
cryptographic techniques we use are thought to have these 
properties. 

Previous Byzantine-fault tolerant state-machine repli- 
cation systems [6, 26, 16] also rely on the assumptions 
described above. We require no additional assumptions 
to match the guarantees provided by these systems, i-e., 
to provide safety if less than 1/3 of the replicas become 
faulty during the lifetime of the system. To tolerate more 
faults we need additional assumptions: we must mutu- 
ally authenticate a faulty replica that recovers to the other 
replicas, and we need a reliable mechanism to trigger pe- 
riodic recoveries. These could be achieved by involving 
system administrators in the recovery process, but such 
an approach is impractical given our goal of recovering 
replicas frequently. Instead, we rely on the following 
assumptions: 


Secure Cryptography. Each replica has a secure crypto- 
graphic co-processor, e.g., a Dallas Semiconductors iBut- 
ton, or the security chip in the motherboard of the IBM 
PC 300PL. The co-processor stores the replica’s private 
key, and can sign and decrypt messages without exposing 
this key. It also contains a true random number generator, 
e.g., based on thermal noise, and a counter that never goes 
backwards. This enables it to append random numbers 
or the counter to messages it signs. 


Read-Only Memory. Each replica stores the public keys 
for other replicas in some memory that survives failures 
without being corrupted (provided the attacker does not 
have physical access to the machine). This memory could 
be a portion of the flash BIOS. Most motherboards can 
be configured such that it is necessary to have physical 
access to the machine to modify the BIOS. 


Watchdog Timer. Each replica has a watchdog timer 
that periodically interrupts processing and hands control 
to a recovery monitor, which is stored in the read- 
only memory. For this mechanism to be effective, an 
attacker should be unable to change the rate of watchdog 
interrupts without physical access to the machine. Some 
motherboards and extension cards offer the watchdog 
timer functionality but allow the timer to be reset without 
physical access to the machine. However, this is easy to 
fix by preventing write access to control registers unless 
some jumper switch is closed. 

These assumptions are likely to hold when the attacker 
does not have physical access to the replicas, which we 
expect to be the common case. When they fail we can 
fall back on system administrators to perform recovery. 





274 





4th Symposium on Operating Systems Design and Implementation 





USENIX Association 


USENIX Association 


Note that all previous proactive security algo- 
rithms [24, 13, 14, 3, 10] assume the entire program run 
by a replica is in read-only memory so that it cannot be 
modified by an attacker. Most also assume that there are 
authenticated channels between the replicas that continue 
to work even after a replica recovers from a compromise. 
These assumptions would be sufficient to implement our 
algorithm but they are less likely to hold in practice. 
We only require a small monitor in read-only memory 
and use the secure co-processors to establish new session 
keys between the replicas after a recovery. 


The only work on proactive security that does not 
assume authenticated channels is [3], but the best that 
a replica can do when its private key is compromised 
in their system is alert an administrator. Our secure 
cryptography assumption enables automatic recovery 
from most failures, and secure co-processors with the 
properties we require are now readily available, e.g., IBM 
is selling PCs with a cryptographic co-processor in the 
motherboard at essentially no added cost. 


We also assume clients have a secure co-processor; 
this simplifies the key exchange protocol between clients 
and replicas but it could be avoided by adding an extra 
round to this protocol. 


3 Algorithm Properties 


Our algorithm is a form of state machine replication [17, 
28): the service is modeled as a state machine that is 
replicated across different nodes in a distributed system. 
The algorithm can be used to implement any replicated 
service with a state and some operations. The operations 
are not restricted to simple reads and writes; they can 
perform arbitrary computations. 


The service is implemented by a set of replicas 
R and each replica is identified using an integer in 
{0,..., || — 1}. Each replica maintains a copy of the 
service state and implements the service operations. For 
simplicity, we assume |[R| = 3f + 1 where f is the 
maximum number of replicas that may be faulty. Service 
clients and replicas are non-faulty if they follow the 
algorithm and if no attacker can impersonate them (e.g., 
by forging their MACs). 

Like all state machine replication techniques, we 
impose two requirements on replicas: they must start 
in the same state, and they must be deterministic (i.e., the 
execution of an operation in a given state and with a given 
set of arguments must always produce the same result). 
We can handle some common formsof non-determinism 
using the technique we described in [6]. 


Our algorithm ensures safety for an execution pro- 
vided at most f replicas become faulty within a window 
of vulnerability of size T,. Safety means that the repli- 
cated service satisfies linearizability [12, 5]: it behaves 
like a centralized implementation that executes opera- 
tions atomically one at a time. Our algorithm provides 
safety regardless of how many faulty clients are using 
the service (even if they collude with faulty replicas). 


We will discuss the window of vulnerability further in 
Section 4.7. 

The algorithm also guarantees liveness: non-faulty 
clients eventually receive replies to their requests pro- 
vided (1) at most f replicas become faulty within the 
window of vulnerability T,,; and (2) denial-of-service at- 
tacks do not last forever, i.e., there is some unknown point 
in the execution after which all messages are delivered 
(possibly after being retransmitted) within some constant 
time d, or all non-faulty clients have received replies to 
their requests. Here, d is a constant that depends on the 
timeout values used by the algorithm to refresh keys, and 
trigger view-changes and recoveries. 


4 Algorithm 


The algorithm works as follows. Clients send requests 
to execute operations to the replicas and all non-faulty 
replicas execute the same operations in the same order. 
Since replicas are deterministic and start in the same state, 
all non-faulty replicas send replies with identical results 
for each operation. The client waits for f + 1 replies from 
different replicas with the same result. Since at least one 
of these replicas is not faulty, this is the correct result of 
the operation. 

The hard problem is guaranteeing that all non-faulty 
replicas agree on a total order for the execution of 
requests despite failures. We use a primary-backup 
mechanism toachieve this. Insuch a mechanism, replicas 
move through a succession of configurations called views. 
In a view one replica is the primary and the others are 
backups. We choose the primary of a view to be replica 
p such that p = v mod |R|, where v is the view number 
and views are numbered consecutively. 


The primary picks the ordering for execution of 
operations requested by clients. It does this by assigning 
a sequence number to each request. But the primary may 
be faulty. Therefore, the backups trigger view changes 
when it appears that the primary has failed toselect a new 
primary. Viewstamped Replication [23] and Paxos [18] 
use a similar approach to tolerate benign faults. 


To tolerate Byzantine faults, every step taken by a 
node in our system is based on obtaining a certificate. A 
certificate is a set of messages certifying some statement 
is correct and coming from different replicas. An example 
of a statement is: “the result of the operation requested 
by a client is 7”. 

The size ofthe set of messages ina certificate is either 
f +1 or2f + 1, depending on the type of statement and 
step being taken. The correctness of our system depends 
on a certificate never containing more than f messages 
sent by faulty replicas. A certificate of size f + 1 is 
sufficient to prove that the statement is correct because it 
contains at least one message from a non-faulty replica. 
A certificate of size 2f + 1 ensures that it will also be 
possible to convince other replicas of the validity of the 
statement even when f replicas are faulty. 


Our earlier algorithm [6] used the same basic ideas 
butit did not provide recovery. Recovery complicates the 








4th Symposium on Operating Systems Design and Implementation 


275 


276 


construction of certificates; if a replica collects messages 
for a certificate over a sufficiently long period of time 
it can end up with more than f messages from faulty 
teplicas. We avoid this problem by introducing a notion 
of freshness; replicas reject messages that are not fresh. 
But this raises another problem: the view change protocol 
in [6] relied on the exchange of certificates between 
replicas and this may be impossible because some of 
the messages in a certificate may no longer be fresh. 
Section 4.5 describes a new view change protocol that 
solves this problem and also eliminates the need for 
expensive public-key cryptography. 

To provide liveness with the new protocol, a replica 
must be able to fetch missing state that may be held by 
a single correct replica whose identity is not known. In 
this case, voting cannot be used to ensure correctness of 
the data being fetched and it is important to prevent a 
faulty replica from causing the transfer of unnecessary 
or corrupt data. Section 4.6 describes a mechanism to 
obtain missing messages and state that addresses these 
issues and that is efficient to enable frequent recoveries. 


The sections below describe our algorithm. Sec- 
tions 4.2 and 4.3, which explain normal-case request pro- 
cessing, are similar to what appeared in [6]. They are 
presented here for completeness and to highlight some 
subtle changes. 


4.1 Message Authentication 


We use MACs to authenticate all messages. There is a 
pair of session keys for each pair of replicas i and 7: kj; 
is used to compute MACs for messages sent from i to 7, 
and kj ; is used for messages sent from j to 7. 


Some messages in the protocol contain a single MAC 
computed using UMAC32 [2]; we denote such a message 
as (m),.;;, where i is the sender j is the receiver and the 
MAC is computed using k;,;. Other messages contain 
authenticators, we denote such a message as (m)a,, 
where 7 is the sender. An authenticator is a vector of 
MACs, one per replica j (7 4 2), where the MAC in 
entry j is computed using k;,;. The receiver of a message 
verifies its authenticity by checking the corresponding 
MAC in the authenticator. 


Replicas and clients refresh the session keys used 
to send messages to them by sending new-key messages 
periodically (e.g.,every minute). The same mechanism is 
used to establish the initial session mie The message has 
the form (NEW-KEY, t, ..., {k;,i}e;,---,¢);. The message 
is signed by the secure co-processor Rae the replica’s 
private key) and t is the value of its counter; the counter 
is incremented by the co-processor and appended to 
the message every time it generates a signature. (This 
prevents suppress-replay attacks [11].) Each k;,; is the 
key replica 7 should use to authenticate messages it sends 
to 7 in the future; k;,; is encrypted by 7’s public key, so 
that only 7 can read it. Replicas use timestamp ¢ to detect 
spurious new-key messages: ¢ must be larger than the 
timestamp of the last new-key message received from 1. 


Each replica shares a single secret key with each 
client; this key is used for communication in both 
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directions. Thekey is refreshed by the client periodically, 
using the new-key message. Ifa client neglects to do this 
within some system-defined period, a replica discards 
its current key for that client, which forces the client to 
refresh the key. 


When a replica or client sends a new-key message, 
it discards all messages in its log that are not part of a 
complete certificate and it rejects any messages it receives 
in the future that are authenticated with old keys. This 
ensures that correct nodes only accept certificates with 
equally fresh messages, i.e., messages authenticated with 
keys created in the same refreshment phase. 


4.2 Processing Requests 


We use a three-phase protocol to atomically multicast 
requeststo the replicas. The three phases are pre-prepare, 
prepare, and commit. The pre-prepare and prepare phases 
are used to totally order requests sent in the same view 
even when the primary, which proposes the ordering 
of requests, is faulty. The prepare and commit phases 
are used to ensure that requests that commit are totally 
ordered across views. Figure 1 shows the operation of 
the algorithm in the normal case of no primary faults. 


; commit 





request i pre-preparei prepare reply 
Client ~ 
Replica 0 
Replica 1 
Replica 2 
Replica 3 ‘ ‘ : 

unknown pre-prépared prepared comuititted 

Figure 1: Normal Case Operation. Replica 0 is the 


primary, and replica 3 is faulty 


Each replica stores the service state, a log containing 
information about requests, and an integer denoting the 
replica’s current view. The log records information 
about the request associated with each sequence number, 
including its status; the possibilities are: unknown (the 
initial status), pre-prepared, prepared, and committed. 
Figure 1 also shows the evolution of the request status as 
the protocol progresses. We describe how to truncate the 
log in Section 4.3. 


A client c requests the execution of state machine 
operation o by sending a (REQUEST, 0, t, c),,, message to 
the primary. Timestamp t is used to ensure exactly-once 
semantics for the execution of client requests [6]. 


When the primary p receives arequest m from a client, 
it assigns a sequence number n to m. Then it multicasts a 
pre-prepare message with the assignment to the backups, 
and marks m as pre-prepared with sequence number n. 
The message has the form ((PRE-PREPARE, v, 7, d)a,,™), 
where ¥v indicates the view in which the message is being 
sent, and d is m’s digest. 


Like pre-prepares, the prepare and commit messages 
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sent in the other phases also contain n and v. A replica 
only accepts one of these messages if it is in view v; itcan 
verify the authenticity of the message; and n is between 
a low water mark, h, and a high water mark, H. The 
last condition is necessary to enable garbage collection 
and prevent a faulty primary from exhausting the space 
of sequence numbers by selecting a very large one. We 
discuss how A and h advance in Section 4.3. 


A backup 7 accepts the pre-prepare message provided 
(in addition to the conditions above): it has not accepted a 
pre-prepare for view uv and sequence number n containing 
a different digest; it can verify the authenticity of m; and 
dis m's digest. If 7 accepts the pre-prepare, it marks m 
as pre-prepared with sequence number n, and enters the 
prepare phase by multicasting a (PREPARE, ¥, N, d, t) a, 
message to all other replicas. 

When replica z has accepted a certificate with a 
pre-prepare message and 2f prepare messages for the 
same sequence number n and digest d (each from a 
different replica including itself), it marks the message as 
prepared. The protocol guarantees that other non-faulty 
replicas will either prepare the same request or will not 
prepare any request with sequence number 7 in view v. 

Replica z multicasts (COMMIT, v, 7, d, i); Saying it 
prepared the request. This starts the commit phase. 
When a replica has accepted a certificate with 2f + | 
commit messages for the same sequence number n and 
digest d from different replicas (including itself), it marks 
the request as committed. The protocol guarantees that 
the request is prepared with sequence number 7 in view 
vy at f + | or more non-faulty replicas. This ensures 
information about committed requests is propagated to 
new views. 


Replica z executes the operation requested by the 
client when m is committed with sequence number n and 
the replica has executed all requests with lower sequence 
numbers. This ensures that all non-faulty replicas execute 
requests in the same order as required to provide safety. 


After executing the requested operation, replicas 
send a reply to the client c. The reply has the form 
(REPLY, v, £,¢,7,7),,. where ¢ is the timestamp of the 
corresponding request, is the replica number,and7 is the 
result of executing the requested operation. This message 
includes the current view number v so that clients can 
track the current primary. 


The client waits for a certificate with f + 1 replies 
from different replicas and with the same ¢ and r, before 
accepting the result r. This certificate ensures that the 
result is valid. If the client does not receive replies soon 
enough, it broadcasts the request to all replicas. If the 
request is not executed, the primary will eventually be 
suspected to be faulty by enough replicas to cause a view 
change and select a new primary. 


4.3. Garbage Collection 


Replicas can discard entries from their log once the 
corresponding requests have been executed by at least 
f + 1 non-faulty replicas; this many replicas are needed 


to ensure that the execution of that request will be known 
after a view change. 

We can determine this condition by extra communi- 
cation, but to reduce cost we do the communication only 
whena request with a sequence number divisible by some 
constant K (e.g., = 128) is executed. We will refer to 
the states produced by the execution of these requests as 
checkpoints. 

When replica 7 produces a checkpoint, it multicasts 
a (CHECKPOINT, n, d, 2); message to the other replicas, 
where n is the sequence number of the last request whose 
execution is reflected in the state and d is the digest of 
the state. A replica maintains several logical copies of 
the service state: the current state and some previous 
checkpoints. Section 4.6 describes how we manage 
checkpoints efficiently. 

Each replica waits until it has a certificate containing 
2f + | valid checkpoint messages for sequence number n 
with the same digest d sent by different replicas (including 
possibly its own message). At this point, the checkpoint 
is said to be stable and the replica discards all entries in 
its log with sequence numbers less than or equal to n; it 
also discards all eartier checkpoints. 

The checkpoint protocol is used to advance the low 
and high water marks (which limit what messages will 
be added to the log). The low-water mark A is equal to 
the sequence number of the last stable checkpoint and the 
high water mark is H = + L, where L is the log size. 
The log size is obtained by multiplying & by a small 
constant factor (e.g., 2) that is big enough so that replicas 
do not stal! waiting for a checkpoint to become stable. 


4.4 Recovery 


The recovery protocol makes faulty replicas behave 
correctly again to allow the system to tolerate more than 
f faults over its lifetime. To achieve this, the protocol 
ensures that after a replica recovers it is running correct 
code; it cannot be impersonated by an attacker; and it has 
correct, up-to-date state. 


Reboot. Recovery is proactive — it starts periodically 
when the watchdog timer goes off. The recovery monitor 
saves the replica’s state (the log and the service state) 
to disk. Then it reboots the system with correct code 
and restarts the replica from the saved state. The 
correctness of the operating system and service code is 
ensured by storing them in a read-only medium (e.g., the 
Seagate Cheetah 18LP disk can be write protected by 
physically closing a jumper switch). Rebooting restores 
the operating system data structures and removes any 
Trojan horses. 


After this point, the replica's code is correct and it 
did not lose its state. The replica must retain its state 
and use it to process requests even while it is recovering. 
This is vital to ensure both safety and liveness in the 
common case when the recovering replica is not faulty; 
otherwise, recovery could cause the f + Ist fault. But 
if the recovering replica was faulty, the state may be 
corrupt and the attacker may forge messages because it 
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knows the MAC keys used to authenticate both incoming 
and outgoing messages. The rest of the recovery protocol 
solves these problems. 


The recovering replica 2 starts by discarding the keys 
it shares with clients and it multicasts a new-key message 
to change the keys it uses to authenticate messages sent 
by the other replicas. This is important if 2 was faulty 
because otherwise the attacker could prevent a successful 
recovery by impersonating any client or replica. 


Run estimation protocol. Next, 2 runs a simple protoco! 
to estimate an upper bound, // yy, on the high-water mark 
that it would hive in its log if it were not faulty. It discards 
any entries with greater sequence numbers to bound the 
sequence number of corrupt entries in the log. 

Estimation works as follows: 7% multicasts a 
(QUERY-STABLE, 7, 7), Message to all the other replicas, 
where r is arandom nonce. When replica 7 receives this 
message, it replies (REPLY-STABLE, DPE Pidgeon where c 
and p are the sequence numbers of the last checkpoint 
and the last request prepared at 7 respectively. 7 keeps 
retransmitting the query message and processing replies; 
it keeps the minimum value of cand the maximum value 
of p it received from each replica. It also keeps its own 
values of c and p. 


The recovering replica uses the responses to select 
Hy, as follows: Hy = L + cy where Lis the log size 
and cy is a value c received from replica j such that 2f 
replicas other than j reported values for c less than or 
equal to cy and f replicas other than j reported values 
of p greater than or equal to cas. 


For safety, cay must be greater than any stable 
checkpoint so that 2 will not discard log entries when 
it is not faulty. This is insured because if a checkpoint 
is stable it will have been created by at least f + 1 non- 
faulty replicas and it will have a sequence number less 
than or equal to any value of ¢ that they propose. The 
test against p ensures that cay is close to a checkpoint 
at some non-faulty replica since at least one non-faulty 
replica reports a p not less than cas; this is important 
because it prevents a faulty replica from prolonging 2’s 
recovery. Estimation is live because there are 2f + | 
non-faulty replicas and they only propose a value of ¢ 
if the corresponding request committed and that implies 
that it prepared at at least f + 1 correct replicas. 


After this point 2 participates in the protocol as if it 
were not recovering but it will not send any messages 
above ff, until it has a correct stable checkpoint with 
sequence number greater than or equal to Ha,. 


Send recovery request. Next z sends a recovery request 
with the form: (REQUEST, (RECOVERY, Hy), t,i)¢;- 
This message is produced by the cryptographic co- 
processor and ¢ is the co-processor’s counter to prevent 
replays. The other replicas reject the request if it is a re- 
play or if they accepted a recovery request from 7 recently 
(where recently can be defined as half of the watchdog 
period). This is important to prevent a denial-of-service 
attack where non-faulty replicas are kept busy executing 
recovery requests. 


4th Symposium on Operating Systems Design and Implementation 


The recovery request is treated like any other request: 
it is assigned a sequence number npr and it goes through 
the usual three phases. But when another replica executes 
the recovery request, it sends its own new-key message. 
Replicas also send a new-key message when they fetch 
missing state (see Section 4.6) and determine that it 
reflects the execution of a new recovery request. This is 
important because these keys are known to the attacker if 
the recovering replica was faulty. By changing these keys, 
we bound the sequence number of messages forged by 
the attacker that may be accepted by the other replicas — 
they are guaranteed not to accept forged messages with 
sequence numbers greater than the maximum high water 
mark in the log when the recovery request executes, i.e., 
Hr=\|nr/K|x K+L. 

The reply to the recovery request includes the se- 

quence number nr. Replica z uses the same protocol 
as the client to collect the correct reply to its recovery 
request but waits for 2 + 1 replies. Then it computes its 
recovery point, H = max(Hy, Hp). italso computes 
a valid view (see Section 4.5); it retains its current view 
if there are f + | replies for views greater than or equal 
to it, else it changes to the median of the views in the 
replies. 
Check and fetch state. While 7 is recovering, it uses 
the state transfer mechanism discussed in Section 4.6 to 
determine what pages of the state are corrupt and to fetch 
pages that are out-of-date or corrupt. 


Replica i is recovered when the checkpoint with 
sequence number /7 is stable. This ensures that any 
state other replicas relied on 7 to have is actually held 
by f + 1 non-faulty replicas. Therefore if some other 
replica fails now, we can be sure the state of the system 
will not be lost. This is true because the estimation 
procedure run at the beginning of recovery ensures that 
while recovering 7 never sends bad messages for sequence 
numbers above the recovery point. Furthermore, the 
recovery request ensures that other replicas will not 
accept forged messages with sequence numbers greater 
than H. 


Our protocol has the nice property that any replica 
knows that 2 has completed its recovery when checkpoint 
H is stable. This allows replicas to estimate the duration 
ofiz’s recovery, which is useful to detect denial-of-service 
attacks that slow down recovery with low false positives. 


4.5 View Change Protocol 


The view change protocol provides liveness by allowing 
the system to make progress when the current primary 
fails. The protocol must preserve safety: it must ensure 
that non-faulty replicas agree on the sequence numbers 
of committed requests across views. In addition, the 
protocol must provide liveness: it must ensure that non- 
faulty replicas stay in the same view long enough for the 
system to make progress, even in the face of a denial-of- 
service attack. 

The new view change protocol uses the techniques 
described in [6] to address liveness but uses a different 
approach to preserve safety. Our earlier approach relied 
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on certificates that were valid indefinitely. In the new 
protocol, however, the fact that messages can become 
stale means that a replica cannot prove the validity of a 
certificate to others. Instead the new protocol relies on 
the group of replicas to validate each statement that some 
replica claims has a certificate. The rest of this section 
describes the new protocol. 

Data structures. Replicas record information about what 
happened in earlier views. Thisinformation is maintained 
in two sets, the PSet and the QSet. A replica also 
stores the requests corresponding to the entries in these 
sets. These sets only contain information for sequence 
numbers between the current low and high water marks 
in the log; therefore only limited storage is required. The 
sets allow the view change protocol to work properly 
even when more than one view change occurs before the 
system is able to continue normal operation; the sets are 
usually empty while the system is running normally. 


The PSer at replica i stores information about requests 
that have prepared at 2 in previous views. Its entries 
are tuples (n, d, v) meaning that a request with digest d 
prepared at 2 with number 7 in view v and no request 
prepared at z ina later view. 

The QSet stores information about requests that have 
pre-prepared at z in previous views (i.e., requests for 
which 7? has sent a pre-prepare or prepare message). Its 
entries are tuples (7, {..., (dg, uz), ...}) meaning that for 
each k, vg is the latest view in which a request pre- 
prepared with sequence number n and digest d,, at 7. 


View-change messages. View changes are triggered 
when the current primary is suspected to be faulty (e.g., 
when a reques! from a client is not executed after some 
period of time; see |6] for details). When a backup z 
suspects the primary for view v is faulty, itenters view v+ 
1 and multicasts a (VIEW-CHANGE, v + 1, ls,C, P,Q, 1)¢; 
message to all replicas. Here ts is the sequence number 
of the latest stable checkpoint known to 2; C is a set 
of pairs with the sequence number and digest of each 
checkpoint stored at i; and P and Q are sets containing 
a tuple for every request that is prepared or pre-prepared, 
respectively, at z. These sets are computed using the 
information in the log, the PSet, and the QSez, as 
explained in Figure 2. Once the view-change message 
has been sent, 7 stores P in PSet, Q in QSer, and clears 
its log. The computation bounds the size of each tuple in 
QSer; it retains only pairs corresponding to f +2 distinct 
requests (corresponding to possibly f messages from 
faulty replicas, one message trom a good replica, and 
one special null message as explained below). Therefore 
the amount of storage used is bounded. 
View-change-ack messages. Replicas collect view- 
change messages for u+ | and send acknowledgments for 
them to v+1’s primary,p. The acknowledgments have the 
form (VIEW-CHANGE-ACK, v + 1,2, J, d),,,, where i is the 
identifier of the sender, d is the digest of the view-change 
message being acknowledged, and 7 is the replica that 
sent that view-change message. These acknowledgments 
allow the primary to prove authenticity of view-change 
messages sent by faulty replicas as explained later. 


let v be the view before the view change, LE be the size of 
the log, and h be the log’s low water mark 


for all nm such thath < n < h+Ldo 
if request number n with digest d is prepared or 
committed in view v then 
add (n,d, v) to P 
else if 3 (n,d’,v’) € PSet then 
add (n,d',v') to P 
if request number n with digest d is pre-prepared, 
prepared or committed in view v then 
if =5 (n, D) € QSer then 
add (n, {(d,v)}) to Q 
else if 4 (d,v') € D then 
add (n, D U {(d,v)} — {(d,v’)}) to Q 
else if |D| > f+ 1 then 
Temove entry with lowest view number from D 
add (n,D U {(d,v)}) to O 
else if J (n,D) € QSet then 
add (n, D) to Q 


Figure 2: Computing P and Q 


New-view message construction. The new primary 
p collects view-change and view-change-ack messages 
(including messages from itself). It stores view-change 
messages in a set S. It adds a view-change message 
received from replica 2 to S after receiving 2 — | view- 
change-acks for 2’s view-change message from other 
replicas. Each entry in S is for a different replica. 


let D = {(n,d) | 3 2f +1 messages m € S: mls <n 
A Af +1 messagesm € S:(n,d) € m.C} 
if (h,d) € D: V(n',d') € D: n' < h then 
select checkpoint with digest d and number h 
else exit 
for all nm such thath < n < h+Ldo 
A. if im € S with (n,d,v) € m. P that verifies: 
Al.4 2f +1 messages m’ € S: 
m'ls <n A m'. Phas no entry forn or 
Ad (n,d',u')€ m'.P:v' < uv V (v' =v A d' =d) 
A2.4 f +1 messages m’ € S: 
3 (n, {..., (d’,v’),.-})€ m'. OQ: > vu Ad =d 
A3. the primary has the request with digest d 
then select the request with digest d for number n 
B. else if 3 2f + I messages m € S such that 
mls < n A m. Phas no entry for n 
then select the null request for number n 


Figure 3: Decision procedure at the primary. 


The new primary uses the information in S and the 
decision procedure sketched in Figure 3 to choose a 
checkpoint and a set of requests. This procedure runs 
each time the primary receives new information, e.g., 
when it adds a new message to S. 

The primary starts by selecting the checkpoint that is 
going to be the starting state for request processing in 
the new view. It picks the checkpoint with the highest 
number fh from the set of checkpoints that are known 
to be correct and that have numbers higher than the low 
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water mark in the log of at least f + 1 non-faulty replicas. 
The last condition is necessary for safety; it ensures that 
the ordering informationforrequests that committed with 
numbers higher than A is still available. 


Next, the primary selects a request to pre-prepare in 
the new view for each sequence number between Ah and 
h + L (where J, is the size of the log). For each number 
n that was assigned to some request m that committed 
in a previous view, the decision procedure selects m to 
pre-prepare in the new view with the same number. This 
ensures safety because no distinct request can commit 
with that number in the new view. For other numbers, the 
primary may pre-prepare a request that was in progress 
but had nat yet committed, or it might select a special 
null request that goes through the protocol as a regular 
request but whose execution is a no-op. 

We now argue informally that this procedure will 
select the correct value for each sequence number. If 
a request m committed at some non-faulty replica with 
number n, it prepared at at least f + | non-faulty replicas 
and the view-change messages sent by those replicas will 
indicate that m prepared with number n. Any set of at 
least 2 f + | view-change messages for the new view must 
include a message from one of the non-faulty replicas that 
prepared mm. Therefore, the primary for the new view 
will be unable to select a different request for number n 
because no other request will be able to satisfy conditions 
Al or B (in Figure 3). 

The primary will also be able to make the right de- 
cision eventually: condition Al will be satisfied because 
there are 2 f + | non-faulty replicas and non-faulty repli- 
cas never prepare different requests for the same view 
and sequence number; A2 is also satisfied since a request 
that prepares at a non-faulty replica pre-prepares at at 
least f + | non-faulty replicas. Condition A3 may not be 
satisfied initially, but the primary will eventually receive 
the request in a response to its status messages (discussed 
in Section 4.6). When a missing request arrives, this will 
trigger the decision procedure to run. 


The decision procedureends when the primary has se- 
lected a request for each number. This takes O(L x |R|>) 
local steps in the worst case but the normal case is much 
faster because most replicas propose identical values. Af- 
ter deciding, the primary multicasts a new-view message 
to the other replicas with its decision. The new-view 
message has the form (NEW-VIEW, v + 1, V, €)q,. Here, 
Y contains a pair for each entry in S consisting of the 
identifier of the sending replica and the digest of its view- 
change message, and + identifies the checkpoint and 
request values selected. 


New-view message processing. The primary updates its 
state to reflect the information in the new-view message. 
It records all requests in ¥ as pre-prepared in view vu + | 
in its log. If it does not have the checkpoint with sequence 
number / it also initiates the protocol to fetch the missing 
state (see Section 4.6.2). Inany case the primary does not 
accept any prepare or commit messages with sequence 
number less than or equal to fh and does not send any 
pre-prepare message with such a sequence number. 
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The backups for view v+ | collect messages until they 
have a correct new-view message and a correct matching 
view-change message for each pair in V. If some replica 
changes its keys in the middle of a view change, it has to 
discard all the view-change protocol messages it already 
received with the old keys. The message retransmission 
mechanism causes the other replicas to re-send these 
messages using the new keys. 


If a backup did not receive one of the view-change 
messages for some replica with a pair in V, the primary 
alone may be unable to prove that the message it received 
is authentic because it is not signed. The use of view- 
change-ack messages solves this problem. The primary 
only includes a pair for a view-change message in S after 
it collects 2 — | matching view-change-ack messages 
from other replicas. This ensures that at least f + | non- 
faulty replicas can vouch for the authenticity of every 
view-change message whose digest is in Y. Therefore, if 
the original sender of a view-change is uncooperative, the 
primary retransmits that sender’s view-change message 
and the non-faulty backups retransmit their view-change- 
acks. A backup can accept a view-change message whose 
authenticator is incorrect if it receives f view-change- 
acks that match the digest and identifier in V. 


After obtaining the new-view message and the match- 
ing view-change messages, the backups check whether 
these messages support the decisions reported by the pri- 
mary by carrying out the decision procedure in Figure 3. 
If they do not, the replicas move immediately to view 
v +2. Otherwise, they modify their state to account for 
the new information in a way similar to the primary. The 
only difference is that they multicast a prepare message 
for v + 1 for each request they mark as pre-prepared. 
Thereafter, the protocol proceeds as described in Sec- 
tion 4.2. 


The replicas use the status mechanism in Section 4.6 
to request retransmission of missing requests as well 
as missing view-change, view-change acknowledgment, 
and new-view messages. 


4.6 Obtaining Missing Information 


This section describes the mechanisms for message 
retransmission and state transfer. The state transfer 
mechanism is necessary to bring replicas up to date when 
some of the messages they are missing were garbage 
collected. 


4.6.1 Message Retransmission 


We use a receiver-based recovery mechanism similar to 
SRM [8]: a replica 7 multicasts small status messages 
that summarize its state, when other replicas receive a 
status message they retransmit messages they have sent 
in the past that 7 is missing. Status messages are sent 
periodically and when the replica detects that it is missing 
information (i.e., they also function as negative acks). 

If a replica j is unable to validate a status message, it 
sends its last new-key message to 7. Otherwise, 7 sends 
messages it sent in the past that 7 may be missing. For 
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example, if 7 is in a view less than j’s, 7 sends 7 its 
latest view-change message. In all cases, 7 authenticates 
messages it retransmits with the latest keys it received in 
a new-key message from 7. This is important to ensure 
liveness with frequent key changes. 

Clients retransmit requests to replicas until they re- 
ceive enough replies. They measure response times to 
compute the retransmission timeout and use a random- 
ized exponential backoff if they fail to receive a reply 
within the computed timeout. 


4.6.2 State Transfer 


A replica may learn about a stable checkpoint beyond 
the high water mark in its log by receiving checkpoint 
messages or as the result of a view change. In this case, it 
uses the state transfer mechanism to fetch modifications 
to the service state that it is missing. 


It is important for the state transfer mechanism to 
be efficient because it is used to bring a replica up to 
date during recovery, and we perform proactive recover- 
ies frequently. The key issues to achieving efficiency are 
reducing the amount of information transferred and re- 
ducing the burden imposed on replicas. This mechanism 
must also ensure that the transferred state is correct. We 
start by describing our data structures and then explain 
how they are used by thestate transfer mechanism. 


Data Structures. We use hierarchical state partitions 
to reduce the amount of information transferred. The 
root partition corresponds to the entire service state 
and each non-leaf partition is divided into s equal- 
sized, contiguous Sub-partitions. We call leaf partitions 
pages and interior partitions meta-data. For example, 
the experiments described in Section 6 were run with a 
hierarchy with four levels, s equal to 256, and 4KB pages. 

Each replica maintains one logical copy of the parti- 
tion tree fer each checkpoint. The copy is created when 
the checkpoint is taken and it is discarded when a later 
checkpoint becomes stable. The tree for a checkpoint 
stores a tuple (dm, d) for each meta-data partition and a 
tuple (/m, d,p) for each page. Here, lm is the sequence 
number of the checkpoint at the end of the last checkpoint 
interval where the partition was modified, d is the digest 
of the partition, and p is the value of the page. 

The digests are computed efficiently as follows. For 
a page, d is obtained by applying the MDS hash func- 
tion (27] to the string obtained by concatenating the in- 
dex of the page within the state, its value of /m and p. 
For meta-data, d is obtained by applying MDS to the 
string obtained by concatenating the index of the parti- 
tion within its level, its value of/m, and the sum moduloa 
large integer of the digests of its sub-partitions. Thus, we 
apply AdHash [1] at each meta-data level. This construc- 
tion has the advantage that the digests for a checkpoint 
can be obtained efliciently by updating the digests from 
the previous checkpoint incrementally. 

The copies of the partition tree are logical because 
we use copy-on-write so that only copies of the tuples 
modified since the checkpoint was taken are stored. This 


reduces the space and time overheads for maintaining 
these checkpoints significantly. 


Fetching State. The strategy to fetch state is to recurse 
down the hierarchy to determine which partitions are out 
of date. This reduces the amount of information about 
(both non-leaf and leaf) partitions that needs to be fetched. 

A replica ¢ multicasts (FETCH, J, 2, lc, c, k, i); to all 
replicas to obtain information for the partition with index 
z in level ¢ of the tree. Here, lc is the sequence number 
of the last checkpoint 7 knows for the partition, and c is 
either -1 or it specifies that 7 is seeking the value of the 
partition at sequence number c from replica k. 


When a replica z determines that it needs to initiate 
a State transfer, it multicasts a fetch message for the root 
partition with dc equal to its last checkpoint. The value 
of c is non-zero when z knows the correct digest of the 
partition information at checkpoint c, e.g., after a view 
change completes 7 knows the digest of the checkpoint 
that propagated to the new view but might not have it. i 
also creates a new (logical) copy of the tree to store the 
state it fetches and initializes a table CC in which it stores 
the number of the latest checkpoint reflected in the state 
of each partition in the new tree. Initially each entry in 
the table will contain tc. 

If (FETCH, !,2,lc,c,k,z),, is received by the desig- 
nated replier, &, and it has a checkpoint for sequence 
number ¢, it sends back (META-DATA, c, /, 2, P,k), where 
P is a set with a tuple (z',/m,d) for each sub-partition 
of (1, 2) with index 2x’, digest d, and lm > Ic. Since i 
knows the correct digest for the partition value at check- 
point ¢, it can verify the correctness of the reply without 
the need for voting or even authentication. This reduces 
the burden imposed on other replicas. 


The other replicas only reply to the fetch message if 
they have a stable checkpoint greater thanicandc. Their 
replies are similar to k’s except that c is replaced by 
the sequence number of their stable checkpoint and the 
message contains a MAC. These replies are necessary 
to guarantee progress when replicas have discarded a 
specific checkpoint requested by 2. 


Replica 7 retransmits the fetch message (choosing a 
different & each time) until it receives a valid reply from 
some kor f +1 equally fresh responses with the same sub- 
partition values for the same sequence number cp (greater 
than fc and c). Then, it compares its digests for each sub- 
partition of (é, x) with those in the fetched information; it 
multicasts a fetch message for sub-partitions where there 
is a difference, and sets the value in CC to c (or cp) for 
the sub-partitions that are up to date. Since 7 learns the 
correct digest of each sub-partition at checkpoint c (or 
cp) it can use the optimized protocol to fetch them. 

The protocol recurses down the tree until 7 sends 
fetch messages for out-of-date pages. Pages are fetched 
like other partitions except that meta-data replies contain 
the digest and last modification sequence number for the 
page rather than sub-partitions, and the designated replier 
sends back (DATA, z,p). Here, x is the page index and p 
is the page value. The protocol imposes little overhead 
on other replicas; only one replica replies with the full 
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page and it does not even need to compute a MAC for 
the message since 7 can verify the reply using the digest 
it already knows. 

When ? obtains the new value fur a page, it updates 
the state of the page, its digest, the value of the last modi- 
fication sequence number, and the value corresponding to 
the page in CC. Then, the protocol goes up to its parent 
and fetches another missing sibling. After fetching all 
the siblings, it checks if the parent partition is consistent. 
A partition is consistent up to sequence number c if c 
is the minimum of all the sequence numbers in LC for 
its sub-partitions, and c is greater than or equal to the 
maximum of the last modification sequence numbers in 
its sub-partitions. If the parent partition is not consistent, 
the protocol sends another fetch for the partition. Other- 
wise, the protocol goes up again to its parent and fetches 
missing siblings. 

The protocol ends when it visits the root partition 
and determines that it is consistent for some sequence 
number c. Then the replica can start processing requests 
with sequence numbers greater than c. 


Since state transfer happens concurrently with request 
execution at other replicas and other replicas are free to 
garbage collect checkpoints, it may take some time for a 
replica to complete the protocol, e.g., each time it fetches 
a missing partition, it receives information about yet a 
later modification. This is unlikely to be a problem in 
practice (this intuition is confirmed by our experimental 
results). Furthermore, if the replica fetching the state ever 
is actually needed because others have failed, the system 
will wait for it to catch up. 


4.7 Discussion 


Our system ensures safety and liveness for an execution 
7 provided at most f replicas become faulty within a 
window of vulnerability of size T, = 27, + T;. The 
values of Ty and 7; are characteristic of each execution 
7 and unknown to the algorithm. 7; is the maximum 
key refreshment period in 7 for a non-faulty node, and T, 
is the maximum time between when a replica fails and 
when it recovers from that fault in r. 

The message authentication mechanism from Sec- 
tion 4.1 ensures non-faulty nodes only accept certificates 
with messages generated within an interval of size at 
most 2T;.!_ The bound on the number of faults within 
Ty ensures there are never more than f faulty replicas 
within any interval of size at most 27;,. Therefore, safety 
and liveness are provided because non-faulty nodes never 
accept certificates with more than f bad messages. 


We have little control over the value of T, because 
T, may be increased by a denial-of-service attack, but 
we have good control over J, and the maximum time 
between watchdog timeouts, T,,, because their valites 
are determined by timer rates, which are quite stable. 
Setting these timeout values involves a tradeoff between 


‘It would be T; except that during view changes replicas may accept 
messages that are claimed authentic by f + 1 replicas without directly 
checking their authentication token. 


security and performance: small values improve security 
by reducing the window of vulnerability but degrade 
performance by causing more frequent recoveries and 
key changes. Section 6 analyzes this tradeoff. 

The value of T,, should be set based on R,,, the time 
it takes to recover a non-faulty replica under normal load 
conditions. There is no point in recovering a replica 
when its previous recovery has not yet finished; and we 
stagger the recoveries so that no more than f replicas 
are recovering at once, since otherwise service could be 
interrupted even without an attack. Therefore, we set 
Ty = 4x sx R,. Here, the factor 4 accounts for the 
staggered recovery of 3f + | replicas f at atime, and s is 
a safety factor to account for benign overload conditions 
(i.e., no attack). 

Another issue is the bound f on the number of faults. 
Our replication technique is not useful if there is a strong 
positive correlation between the failure probabilities of 
the replicas; the probability of exceeding the bound may 
not be lower than the probability of a single fault in this 
case. Therefore, it is important to take steps to increase 
diversity. One possibility is to have diversity in the exe- 
cution environment: the replicas can be administered by 
different people; they can be in different geographic loca- 
tions; and they can have different configurations (e.g., run 
different combinations of services, or run schedulers with 
different parameters). This improves resilience to several 
types of faults, for example, attacks involving physical 
access to the replicas, administrator attacks or mistakes, 
attacks that exploit weaknesses in other services, and 
software bugs due to race conditions. Another possibil- 
ity is to have software diversity; replicas can run different 
operating systems and different implementations of the 
service code. There are several independent implemen- 
tations available for operating systems and important ser- 
vices (e.g. file systems, data bases, and WWW servers). 
This improves resilience to software bugs and attacks that 
exploit software bugs. 

Even without taking any steps to increase diversity, 
our proactive recovery technique increases resilience to 
nondeterministic. software bugs, to software bugs due 
to aging (e.g., memory leaks), and to attacks that take 
more time than T, to succeed. It is possible to improve 
security further by exploiting software diversity across 
recoveries. One possibility is to restrict the service 
interface at a replica after its state is found to be corrupt. 
Another potential approach is to use obfuscation and 
randomization techniques [7, 9] to produce a new version 
of the software each time a replica is recovered. These 
techniques are not very resilient to attacks but they can 
be very effective when combined with proactive recovery 
because the attacker has a bounded time to break them. 


5 Implementation 


We implemented the algorithm as a library with a very 
simple interface (see Figure 4). Some components of the 
library run on clients and others at the replicas. 


On the client side, the library provides a procedure 
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Client: 
int Byz_init_client(char *conf); 
int Byz_invoke(Byz_req «req, Byz-rep *rep, bool read_only); 


Server: 


int Byz.init_replica(char «conf, char «mem, int size, UC exec); 


void Byz.modify(char «mod, int size); 


Server upcall: 
int execute(Byz_req «req, Byz_rep «rep, int client); 


Figure 4: The replication library API. 


to initialize the client using a configuration file, which 
contains the public keys and IP addresses of the replicas. 
The library also provides a procedure, invoke, that is 
called to cause an operation to be executed. This 
procedure carries out the client side of the protocol and 
returns the result when enough replicas have responded. 

On the server side, we provide an initialization 
procedure that takes as arguments a configuration file 
with the public keys and IP addresses of replicas and 
clients, the region of memory where the application state 
is stored, and a procedure to execute requests. When 
our system needs to execute an operation, it makes an 
upcall to the execute procedure. This procedure carries 
out the operation as specified for the application, using 
the application state. As the application performs the 
operation, each time it is about to modify the application 
State, it calls the modify procedure to inform us of the 
locations about to be modified. This call allows us to 
maintain checkpoints and compute digests efficiently as 
described in Section 4.6.2. 


6 Performance Evaluation 


This section has two parts. First, it presents results of 
experiments to evaluate the benefit of eliminating public- 
key cryptography from the critical path. Then, it presents 
an analysis of the cost of proactive recoveries. 


6.1 Experimental Setup 


All experiments ran with four replicas. Four replicas can 
tolerate one Byzantine fault; we expect this reliability 
level to suffice for most applications. Clients and 
replicas ran on Dell Precision 410 workstations with 
Linux 2.2.16-3 (uniprocessor). These workstations have 
a 600 MHz Pentium III processor, 512 MB of memory, 
and a Quantum Atlas 1OK 18WLS disk. All machines 
were connected by a 100 Mb/s switched Ethernet and 
had 3Com 3C905B interface cards. The switch was an 
Extreme Networks Summit48 V4.1. The experiments ran 
on an isolated network. 

The interval between checkpoints, K, was 128 re- 
quests, which causes garbage collection to occur several 
times in each experiment. The size of the log, L, was 
256. The state partition tree had 4 levels, each internal 
node had 256 children, and the leaves had 4 KB. 


6.2 The cost of Public-Key Cryptography 


To evaluate the benefit of using MACs instead of public 
key signatures, we implemented BFIT-PK. Our previous 
algorithm [6] relies on the extra power of digital sig- 
natures to authenticate pre-prepare, prepare, checkpoint, 
and view-change messages but it can be easily modified 
to use MACs to authenticate other messages. To provide 
a fair comparison, BFT-PK is identical to the BFT library 
but it uses public-key signatures to authenticate these four 
types of messages. We ran a micro benchmark, and a file 
system benchmark to compare the performance of ser- 
vices implemented with the two libraries. There were no 
view changes, recoveries or key changes in these experi- 
ments. 


6.2.1 


The micro-benchmark compares the performance of two 
implementations of a simple service: one implementation 
uses BFT-PK and the other uses BFT. This service has 
no state and its operations have arguments and results of 
different sizes but they do nothing. We also evaluated 
the performance of NO-REP: an implementation of 
the service using UDP with no replication. We ran 
experiments to evaluate the latency and throughput of 
the service. The comparison with NO-REP shows the 
worst case overhead for our library; in real services, the 
relative overhead will be lower due to computation or I/O 
at the clients and servers. 

Table 1 reports the latency to invoke an operation 
when the service isaccessed by a singleclient. The results 
were obtained by timing a large number of invocations 
in three separate runs. We report the average of the three 
runs. The standard deviations were always below 0.5% 
of the reported value. 


BFT-PK | 59368 | 59761 | 59805 
431 999 1046 
106 625 630 


Table 1: Micro-benchmark: operation latency in mi- 
croseconds. Each operation type is denoted by a/b, where 
aand b are the sizes of the argument and result in KB. 


Micro-Benchmark 


BFT 
NO-REP 





BFT-PK has two signatures in the critical path and 
each of them takes 29.4 ms to compute. The algorithm 
described in this paper eliminates the need for these 
signatures. As a result, BFT is between 57 and 138 
times faster than BFT-PK. BFT’s latency is between 60% 
and 307% higher than NO-REP because of additional 
communication and computation overhead. For read- 
only requests, BFT uses the optimization described in [6] 
that reduces the slowdown for operations 0/0 and 0/4 to 
93% and 25%, respectively. 

We also measured the overhead of replication at the 
client. BFF increases CPU time relative to NO-REP by 
up to a factor of 5, but the CPU time at the client is only 
between 66 and 142y:s per operation. BFT also increases 
the number of bytes in Ethernet packets that are sent or 


4th Symposium on Operating Systems Design and Implementation 


283 


OM KK 4 


30000 - 


Me MN ae oe 


S 
3 


5 






| 
0/4 operations per second 
8 
8 


0/0 operations per second 





° 


0 50 100 150 200 0 50 


number of clients 


HHH Ke = = Ke HK 


100 
number of clients 





20a : 
«+ NO-REP 
-*- BFT 

~«- BFT-PK 





4/0 operations per second 


260 0 20 40 60 
number of clients 


150 


Figure 5: Micro-benchmark: throughput in operations per second. 


received by the client: 405% for the 0/0 operation but 
only 12% for the other operations. 

Figure 5 compares the throughput of the different im- 
plementations of the service as a function of the number 
of clients. The client processes were evenly distributed 
over 5 client machines? and each client process invoked 
operations synchronously, i.e., it waited for a reply before 
invoking a new operation. Each point in the graph is the 
average of at least three independent runs and the stan- 
dard deviation for all points was below 4% of the reported 
value (except that it was as high as 17% for the last four 
points in the graph for BFT-PK operation 4/0). There are 
no points with more than 15 clients for NO-REP opera- 
tion 4/0 because of lost request messages; NO-REP uses 
UDP directly and does not retransmit requests. 

The throughput of both replicated implementations 
increases with the number of concurrent clients because 
the library implements batching [4]. Batching inlines 
several requests in each pre-prepare message to amortize 
the protocol overhead. BFT-PK performs 5 to | 1 times 
worse than BFT because signing messages leads to a 
high protocol overhead and there is a limit on how many 
requests can be inlined in a pre-prepare message. 

The bottleneck in operation 0/0 is the server’s CPU; 
BFT’s maximum throughput is 53% lower than NO- 
REP’s due to extra messages and cryptographic oper- 
ations that increase the CPU load. The bottleneck in 
operation 4/0 is the network; BFI’s throughput is within 
11% of NO-REP'’s because BFT does not consume signif- 
icantly more network bandwidth in this operation. BFT 
achieves a maximum aggregate throughput of 26 MB/s 
in operation 0/4 whereas NO-REP is limited by the link 
bandwidth (approximately 12 MB/s). The throughput is 
better in BI'T because of an optimization that we de- 
scribed in [6]: each client chooses one replica randomly; 
this replica’s reply includes the 4 KB but the replies of 
the other replicas only contain small digests. As a result, 
clients obtain the large replies in parallel from different 
replicas. We refer the reader to [4] for a detailed analysis 
of these latency and throughput results. 


2Two client machines had 700 MHz PIlIs but were otherwise 


identical to the other machines. 


6.2.2 File System Benchmarks 


We implemented the Byzantine-fault-tolerant NFS ser- 
vice that was described in [6]. The next set of exper- 
iments compares the performance of two implementa- 
tions of this service: BFS, which uses BFT, and BFS-PK, 
which uses BFT-PK. 

The experiments ran the modified Andrew bench- 
mark [25, 15], which emulates a software development 
workload. It has five phases: (1) creates subdirectories 
recursively; (2) copies a source tree; (3) examines the 
status of all the files in the tree without examining their 
data; (4) examines every byte of data in all the files; and 
(5) compiles and links the files. Unfortunately, Andrew 
is So small for today’s systems that it does not exercise 
the NFS service. So we increased the size of the bench- 
mark by a factor of n as follows: phase | and 2 create 
n copies of the source tree, and the other phases operate 
in afl these copies. We ran a version of Andrew with 
n equal to 100, Andrew100, and another with n equal 
to 500, Andrew500. BFS builds a file system inside a 
memory mapped file |G]. We ran Andrew100 in a file 
system file with 205 MB and AndrewS00 in a file sys- 
tem file with | GB; both benchmarks fill 90% of theses 
files. Andrewl100 fits in memory at both the client and 
the replicas but Andrew500 does not. 

We also compare BFS and the NFS implementationin 
Linux, NFS-std. The performance of NFS-std is a good 
metric of what is acceptable because it is used daily by 
many users. For all configurations, the actual benchmark 
code ran at the client workstation using the standard NFS 
client implementation in the Linux kernel with the same 
mount options. The most relevant of these options for 
the benchmark are: UBP transport, 4096-byte read and 
write buffers, allowing write-back client caching, and 
allowing attribute caching. 

Tables 2 and 3 present the results for these experi- 
ments. We report the mean of 3 runs of the benchmark. 
The standard deviation was always below 1% of the re- 
ported averages except for phase 1 where it was as high 
as 33%. The results show that BFS-PK takes 12 times 
longer than BFS torun Andrew100 and 15 times longer 
to run AndrewS00. The slowdown is smaller than the 
one observed with the micro-benchmarks because the 
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NFS-std 


BFS-PK | BFS 








Table 2: Andrew 100: elapsed time in seconds 


client performs a significant amount of computation in 
this benchmark. 

Both BFS and BFS-PK use the read-only optimiza- 
tion described in [6] for reads and lookups, and as a 
consequence do not set the time-last-accessed attribute 
when these operations are invoked. This reduces the per- 
formance difference between BFS and BFS-PK during 
phases 3 and 4 where most operations are read-only. 





Table 3: AndrewS00: elapsed time in seconds 


BFS-PK is impractical but BFS’s performance is close 
to NFS-std: it performs only 15% slower in Andrew100 
and 24% slower in Andrew500. The performance dif- 
ference would be lower if Linux implemented NFS cor- 
rectly. For example, we reported previously [6] that BFS 
was only 3% slower than NFS in Digital Unix, which 
implements the correct semantics. The NFS implemen- 
tation in Linux does not ensure stability of modified data 
and meta-data as required by the NFS protocol, whereas 
BFS ensures stability through replication. 


6.3. The Cost of Recovery 


Frequent proactive recoveries and key changes improve 
resilience to faults by reducing the window of vulnerabil- 
ity, but they also degrade performance. We ran Andrew 
to determine the minimum window of vulnerability that 
can be achieved without overlapping recoveries. Then we 
configured the replicated file system to achieve this win- 
dow, and measured the performance degradation relative 
to a system without recoveries. 


The implementation of the proactive recovery mech- 
anism is complete except that we are simulating the se- 
cure co-processor, the read-only memory, and the watch- 
dog timer in software. We are also simulating fast re- 
boots. The LinuxBIOS project [22] has been experiment- 
ing with replacing the BIOS by Linux. They claim to be 
able to reboot Linux in 35 s(0.1 sto get the kernel running 
and 34.9 to execute scripts in /etc/rc.d) [22]. This 
means that in a suitably configured machine we should 
be able to reboot in less than a second. Replicas simulate 


a reboot by sleeping either | or 30 seconds and calling 
msync to invalidate the service-state pages (this forces 
reads from disk the next time they are accessed). 


6.3.1 Recovery Time 


The time to complete recovery determines the minimum 
window of vulnerability that can be achieved without 
overlaps. We measured the recovery time for Andrew 100 
and Andrew500 with 30s reboots and with the period 
between key changes, T;,, set to 15s. 

Table 4 presents a breakdown of the maximum time to 
recover a replica in both benchmarks. Since the processes 
of checking the state for correctness and fetching missing 
updates over the network to bring the recovering replica 
up to date are executed in parallel, Table 4 presents a 
single line for both of them. The line labeled restore 
state only accounts for reading the log from disk the 
service state pages are read from disk on demand when 
they are checked. 


[ ] Andrewi00 | Andrew500 


save state 
reboot 
restore state 
estimation 


send new-key 
send request 
fetch and check 


P tol | 4259 | 143.68 





Table 4: Andrew: recovery time in seconds. 


The most significant components of the recovery time 
are the time to save the replica’s log and service state 
to disk, the time to reboot, and the time to check and 
fetch state. The other components are insignificant. The 
time to reboot is the dominant component for Andrew 100 
and checking and fetching state account for most of the 
recovery time in Andrew500 because the state is bigger. 

Given these times, we set the period between watch- 
dog timeouts, Ty, to 3.5 minutes in Andrew 100 and to 10 
minutes in Andrew500. These settings correspond to a 
minimum window of vulnerability of 4 and 10.5 minutes, 
respectively. We also run the experiments for Andrew 100 
with a ls reboot and the maximum time to complete re- 
covery in this case was 13.3s. This enables a window of 
vulnerability of 1.5 minutes with T,, set to 1 minute. 

Recovery must be fast to achieve a small window of 
vulnerability. While the current recovery times are low, it 
is possible to reduce them further. For example, the time 
to check the state can be reduced by periodically backing 
up the state onto a disk that is normally write-protected 
and by using copy-on-write to create copies of modified 
pages on a writable disk. This way only the modified 
pages need to be checked. If the read-only copy of the 
state is brought up to date frequently (e.g., daily), it will 
be possible to scale to very large states while achieving 
even lower recovery times. 


4th Symposium on Operating Systems Design and Implementation 


285 


286 


6.3.2 Recovery Overhead 


We also evaluated the impact of recovery on performance 
in the experimental setup described in the previous sec- 
tion. Table 5 shows the results. BFS-rec is BFS with 
proactive recoveries. The results show that adding fre- 
quent proactive recoveries to BFS has a low impact on 
performance: BFS-rec is 16% slower than BFS in An- 
drew 100 and 2% slower in Andrew500. In Andrew100 
with 1s reboot and a window of vulnerability of 1.5 min- 
utes, the time to complete the benchmark was 482.4s; this 
is only 27% slower than the time without recoveries even 
though every 15s one replica starts a recovery. 

The results also show that the period between key 
changes, T;,, can be small without impacting performance 
significantly. T;, could be smaller than 15s but it should be 
substantially larger than 3 message delays under normal 
load conditions to provide liveness. 


Andrew100 | Andrew500 


BFS-rec 443.5 2257.8 
381.3 2202.9 
332.0 1781.6 


BFS 
NFS-std 





Table 5: Andrew: recovery overhead in seconds. 


There are several reasons why recoveries have a 
low impact on performance. The most obvious is that 
recoveries are staggered such that there is never more 
than one replica recovering; this allows the remaining 
replicas to continue processing client requests. But it is 
necessary to perform a view change whenever recovery 
is applied to the current primary and the clients cannot 
obtain further service until the view change completes. 
These view changes are inexpensive because a primary 
multicasts a view-change message just before its recovery 
starts and this causes the other replicas to move tothenext 
view immediately. 


7 Related Work 


Most previous work on replication techniques assumed 
benign faults, e.g., (17, 23, 18, 19] or a synchronous sys- 
tem model, e.g., [28]. Earlier Byzantine-fault-tolerant 
systems [26, 16, 20], including the algorithm we de- 
scribed in [6], could guarantee safety only if fewer than 
1/3 of the replicas were faulty during the lifetime of the 
system. This guarantee is too weak for long-lived sys- 
tems. Our system improves this guarantee by recovering 
replicas proactively and frequently; it can tolerate any 
number of faults if fewer than 1/3 of the replicas be- 
come faulty within a window of vulnerability, which can 
be made small under normal load conditions with low 
impact on performance. 

In a previous paper [6], we described a system that 
tolerated Byzantine faults in asynchronous systems and 
performed well. This paper extends that work by 
providing recovery, a state transfer mechanism, and a new 
view change mechanism that enables both recovery and 
an important optimization — the use of MACs instead of 
public-key cryptography. 
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Rampart [26] and SecureRing (16] provide group 
membership protocols that can be used to implement 
recovery, but only in the presence of benign faults. These 
approaches cannot be guaranteed to work in the presence 
of Byzantine faults for two reasons. First, the system may 
be unable to provide safety if a replica that is not faulty 
is removed from the group to be recovered. Second, the 
algorithms rely on messages signed by replicas even after 
they are removed from the group and there is no way to 
prevent attackers from impersonating removed replicas 
that they controlled. 

The problem of efficient state transfer has not been 
addressed by previous work on Byzantine-fault-tolerant 
replication. We present an efficient state transfer mecha- 
nism that enables frequent proactive recoveries with low 
performance degradation. 

Public-key cryptography was the major performance 
bottleneck in previous systems [26, 16] despite the fact 
that these systems include sophisticated techniques to 
reduce the cost of public-key cryptography at the expense 
of security or latency. They cannot use MACs instead 
of signatures because they rely on the extra power of 
digital signatures to work correctly: signatures allow the 
receiver of a message to prove to others that the message 
is authentic, whereas this may be impossible with MACs. 
The view change mechanism described in this paper does 
not require signatures. It allows public-key cryptography 
to be eliminated, except for obtaining new secret keys. 
This approach improves performance by up to two orders 
of magnitude without loosing security. 

The concept of a system that can tolerate more than 
f faults provided no more than f nodes in the system 
become faulty in some time window was introduced 
in [24]. This concept has previously been applied 
in synchronous systems to secret-sharing schemes [13], 
threshold cryptography [14], and more recently secure 
information storage and retrieval [10] (which provides 
single-writer single-reader replicated variables). But our 
algorithm is more general; it allows a group of nodes in 
an asynchronous system to implement an arbitrary state 
machine. 


8 Conclusions 


This paper has described a new state-machine replication 
system that offers both integrity and high availability in 
the presence of Byzantine faults. The new system can 
be used to implement real services because it performs 
well, works in asynchronous systems like the Internet, 
and recovers replicas to enable long-lived services. 


The system described here improves the security and 
robustness against software errors of previous systems 
by recovering replicas proactively and frequently. It 
can tolerate any number of faults provided fewer than 
1/3 of the replicas become faulty within a window 
of vulnerability. This window can be small (e.g., a 
few minutes) under normal load conditions and when 
the attacker does not corrupt replicas’ copies of the 
service state. Additionally, our system provides intrusion 
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detection; it detects denial-of-service attacks aimed at 
increasing the window and detects the corruption of the 
state of a recovering replica. 

Recovery from Byzantine faults is harder than recov- 
ery from benign faults for several reasons: the recovery 
protocol itself needs to tolerate other Byzantine-faulty 
replicas; replicas must be recovered proactively; and at- 
tackers must be prevented from impersonating recovered 
replicas that they controlled. For example, the last re- 
quirement prevents signatures in messages from being 
valid indefinitely. However, this leads to a further prob- 
lem, since replicas may be unable to prove to a third party 
that some message they received is authentic (because its 
signature is no longer valid). All previous state-machine 
replication algorithms relied on such proofs. Our algo- 
rithm does not rely on these proofs and has the added 
advantage of enabling the use of symmetric cryptogra- 
phy for authentication of all protocol messages. This 
eliminates the use of public-key cryptography, the major 
performance bottleneck in previous systems. 


The algorithm has been implemented as a generic 
program library with a simple interface that can be used 
to provide Byzantine-fault-tolerant versions of different 
services. We used the library to implement BFS, a 
replicated NFS service, and ran experiments to determine 
the performance impact of our techniques by comparing 
BFS with an unreplicated NFS, The experiments show 
that it is possible to use our algorithm to implement real 
services with performance close to that of an unreplicated 
service. Furthermore, they show that the window of 
vulnerability can be made very small: 1.5 to 10 minutes 
with only 2% to 27% degradation in performance. 
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Abstract: We explore the abstraction of failure transparency 
in which the operating system provides the illusion of fail- 
ure-free operation. To provide failure transparency, an oper- 
ating system must recover applications after hardware, 
operating system, and application failures, and must do so 
without help from the programmer or unduly slowing fail- 
ure-free performance. We describe two invariants that must 
be upheld to provide failure transparency: one that ensures 
sufficient application state is saved to guarantee the user can- 
not discern failures, and another that ensures sufficient appli- 
cation state is lost to allow recovery from failures affecting 
application state. We find that several real applications get 
failure transparency in the presence of simple stop failures 
with overhead of 0-12%. Less encouragingly, we find that 
applications violate one invariant in the course of upholding 
the other for more than 90% of application faults and 3-15% 
of operating system faults, rendering transparent recovery 
impossible for these cases. 


1. Introduction 

One of the most important jobs of the operating system 
is to conceal the complexities and inadequacies of the under- 
lying machine. Towards this end, modem operating systems 
provide a variety of abstractions. To conceal machines’ lim- 
ited memory, for example, operating systems provide the 
abstraction of practically boundless virtual memory. Simi- 
larly, operating systems give the abstraction of multithread- 
ing for those applications that might benefit from more 
processors than are present in hardware. 

Failures by computer system components, be they 
hardware, software, or the application, are a shortcoming of 
modern systems that has not been abstracted away. Instead, 
computer programmers and users routinely have to deal with 
the effects of failures, even on machines running state-of- 
the-art operating systems. 

With this paper we explore the abstraction of failure 
transparency in which the operating system generates the 
illusion of failure-free operation. To provide this illusion, the 
operating system must handle all hardware, software, and 
application failures to keep them from affecting what the 
user sees. Furthermore, the operating system must do so 
without help from the programmer and without unduly slow- 
ing down failure-free operation. 
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Fault-tolerance research has established many of the 
components of failure transparency, such as programmer- 
transparent recovery [4, 11, 25, 28], and recovery for general 
applications [4, 14]. Some researchers have even discussed 
handling application failures [13, 17, 31]. 

However, significant questions surrounding failure 
transparency remain. The focus of this paper is on delving 
into several of these unanswered questions. First, we will 
explore the question “how does one guarantee failure trans- 
parency in general?” The answer to this question comes in 
the form of two invariants. The first invariant is a reformula- 
tion of existing recovery theory, governing when an applica- 
tion must save its work to ensure that the user does not 
discern failures. In contrast, the second invariant governs 
how much work an application must lose to avoid forcing the 
same failure during recovery. 

The Save-work invariant can require applications to 
commit their state frequently to stable storage. The question 
therefore arises “how expensive is it for general applications 
to uphold the Save-work invariant?” In answering this ques- 
tion we find, to our surprise, that even complex, general 
applications are able to efficiently uphold Save-work. 

Given that the Save-work invariant forces applications 
to preserve work and the Lose-work invariant forces applica- 
tions to throw work away, we conclude by investigating the 
question, “how often do these invariants conflict, making 
failure transparency impossible?” The unfortunate answer is 
that the invariants conflict all too often. 


2. Guaranteeing Failure Transparency 


We first delve into the question: how does one guaran- 
tee failure transparency in general? Our exploration begins 
with a synthesis of existing recovery theory that culminates 
in the Save-work invariant. In Section 2.4, we then extend 
recovery theory to point out a parameterization of the space 
of recovery protocols, as well as the relationship between 
protocols at different points in the space. Finally, we develop 
a new theory and second invariant for ensuring the possibil- 
ity of recovery from failures that affect application state. 


2.1. Primitives for general recovery 


In attempting to provide failure transparency, the goal 
is to recover applications using only general techniques that 
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require no help from the application. There are several 
recovery primitives available to us in this domain: commit 
events, rollback of a process, and reexecution from a prior 
state. 

A process can execute commit events to aid recovery 
after a failure. By executing a commit event, a process pre- 
serves its state at the time of the commit so that it can later 
restore that state and continue execution. Although how 
commit events are implemented is not important to our dis- 
cussion, executing a commit event might involve writing out 
a full-process checkpoint to stable storage, ending a transac- 
tion, or sending a state-update message to a backup process. 

When a failure occurs, the application undergoes roll- 
back of its failed processes; each failed process is returned to 
its last committed state. From that state, the recovering pro- 
cess begins reexecution, possibly recomputing work lost in 
the failure. 

Providing generic recovery requires that applications 
tolerate forced rollback and reexecution. As a result, all 
application operations must be either undoable or redoable. 

Most application operations that simply modify process 
state are easily undone. However, some events, such as mes- 
sage sends, are hard to undo. Undoing a send can involve the 
added challenge of rolling back the recipient’s state. Other 
events can be impossible to undo. For example, we cannot 
undo the effects on the user resulting from visible output. 
However, systems providing failure transparency ensure that 
these user-visible events will never be undone. 

Similarly, since simple state changes by the application 
are idempotent, most application events can be safely 
redone. However, events like message sends and receives are 
more difficult to redo. For message send events to be redo- 
able, the application must either tolerate or filter duplicate 
messages. For receive events to be redoable, messages must 
be saved at either the sender or receiver so they can be re- 
delivered after a failure. Luckily, these reexecution require- 
ments are very similar to the demands made of systems that 
transmit messages on unreliable channels (e.g. UDP). Such 
systems must already work correctly even with lost or dupli- 
cated messages. For many recovery systems, an application 
or protocol layer’s natural filtering and retransmission mech- 
anisms will be enough to support the needs of reexecution 
recovery. For others, messages may have to be held in a 
recovery buffer of some kind so they can be re-delivered 
should a receive event be redone. 


2.2. Computation and failure model 


We will informally present a recovery theory that will 
let us relate the challenge of guaranteeing failure transpar- 
ency to the precise events executed by an application. For a 
more formal version of the theory, please see [22]. 

We begin by constructing a model of computing. One 
or more processes working together on a task is called a 
computation. We model each process as a finite state 
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machine. That is, each process has state and computes by 
transitioning from state to state according to the inputs it 
receives. Each state transition executed by a process is an 
event. An event ej, is the i’th event executed by process p. 
Events can correspond in real programs to simple changes of 
application state, sending and receiving messages, and so on. 
We call events that have an effect on the user visible events 
(these events have traditionally been called “output events” 
{11]). Under our model, computation proceeds asynchro- 
nously, that is, without known bounds on message delivery 
time or the relative speeds of processes. 

As needed, we will order events in our asynchronous 
computations with Lamport’s happens-before relation [19]. 
We may also need to discuss the causal relationship between 
events. For example, we may need to ask, “did event e in 
some way cause event e'?” We will use happens-before as 
an approximation of causality. We will however distinguish 
between happens-before’s use as an ordering constraint and 
its use as an approximation of causality by using the expres- 
sion causally precedes in this latter role. That is, we say 
event e causally precedes event e' if and only if e happens- 
before e' and we intend to convey that e causes event e'. 

We will consider failures of two forms. A stop failure is 
one in which execution of one or more processes in the com- 
putation simply ceases. Stop failures do occur in real sys- 
tems—the loss of power, the frying of a processor, or the 
abrupt halting of the operating system all appear to the 
recovery system as stop failures. Since stop failures instanta- 
neously stop the execution of the application and do not cor- 
Tupt application state, recovering from them is relatively 
easy. 

Harder to handle are propagation failures. We define a 
propagation failure to be one in which a bug somewhere in 
the system causes the application to enter a state it would not 
enter in a failure-free execution. A propagation failure can 
begin with a bug in hardware, the operating system, or the 
application. Bugs in the application are always propagation 
failures, but bugs in hardware and the operating system are 
propagation failures only once they affect application state. 

Recovering from propagation failures is hard because a 
process can execute for some time after the failure is trig- 
gered. During that time the process can propagate buggy data 
into larger portions of its state, to other processes, or onto 
stable storage. 

We can imagine bugs that do not cause crashes, but that 
simply cause incorrect visible output by the application. 
However, our focus with this work is on recovering from 
failures. Therefore, we will assume that applications will 
detect faults and fail before generating incorrect output. 


2.3. Failure transparency for stop failures 


We start by examining how to ensure failure transpar- 
ency in the presence of stop failures. We must first fix a pre- 
cise notion of “correct” recovery from failures. We could 
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establish almost any standard: recovering the exact pre-fail- 
ure state, losing less than 10 seconds of work, and so on. 
However, given that our end goal is to mask failures from the 
user, we will define correct recovery in terms of the applica- 
tion output seen by the user. 

Given a computation in which processes have failed, 
recovered, and continued execution: 


Definition: Consistent Recovery 


Recovery is consistent if and only if there exists a com- 
plete, failure-free execution of the computation that 
would result in a sequence of visible events equivalent 
to the sequence of visible events actually output in the 
failed and recovered run. 


Thus for an application’s recovery to be consistent, the 
sum total of the application’s visible output before and after 
a failure must be equivalent to the output from some failure- 
free execution of the application. 

It is possible that many different modes of consistent 
recovery could be allowed depending on how one defines 
“equivalent”. For our purposes, we will call a sequence of 
visible events V output by a recovered computation equiva- 
lent to sequence V' output by a failure-free run if the only 
events in V that differ from V' are repeats of earlier events 
from V. 

We use equivalence in which duplicate visible events 
are allowed because guaranteeing no duplication is very hard 
(exactly once delivery problem). Furthermore, allowing 
duplicates provides some flexibility in how one attains con- 
sistent recovery. More importantly, users can probably over- 
look duplicated visible events. See [22] for a more detailed 
discussion of equivalence. 

Our definition of consistent recovery places two con- 
straints on recovering applications. First, computations must 
always execute visible events that extend a legal, failure-free 
sequence of visible events, even in the presence of failures. 
We will call this the visible constraint. Second, computations 
must always be able to execute to completion. This latter 
constraint follows from the fact that consistent recovery is 
defined in terms of complete sequences of visible events. If a 
failure prevents an application from running to completion, 
its sequence can never be complete. For reasons that will 
become clear later, we will call this second constraint on 
recovery the no-orphan constraint. 

Although consistent recovery and failure transparency 
are Closely related, they are not the same thing. Providing 
failure transparency amounts to guaranteeing consistent 
recovery without any help from the application, and without 
slowing the application’s execution appreciably. 

Our next task is to examine how to guarantee applica- 
tions get consistent recovery. One particular class of events 
poses the greatest challenge: non-deterministic events. In a 
state-machine, a non-deterministic event is a transition from 
a State that has multiple possible next states. For example, in 
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‘heads” 





‘tails” 


Figure 1: Coin-flip application. Depending on whelher non- 
deterministic event e or e° gets executed, the application 
executes one of two possible visible events. 


Figure 1, events e' and e” are both non-deterministic. In 


real systems, non-deterministic events correspond to actions 
that can have different results before and after a failure, like 
checking the time-of-day clock, taking a signal, reading user 
input, or receiving a message. 

Non-deterministic events are intimately related to con- 
sistent recovery. To see how, again consider the application 
shown in Figure 1. Imagine that the application executes 
non-deterministic event e! , then the visible event “heads”, 
then fails. Then during recovery imagine that the application 
rolls back and this time executes e? followed by the visible 
event “tails”. Although this application can correctly output 
either heads or tails, in no correct execution does it output 
both heads and tails. Therefore, recovery in this example is 
not consistent and our sample application’s non-determinis- 
tic events are the culprits. 

As discussed in Section 2.1, applications can execute 
commit events to aid later rollback. We would like to use 
commit events to guarantee consistent recovery, avoiding the 
inconsistency non-deterministic events can cause. The fol- 
lowing theorem provides the necessary and sufficient condi- 
tion for doing exactly that under stop failures. 


Save-work Theorem 

A computation is guaranteed consistent recovery from 

stop failures if and only if for each executed non-deter- 

ministic event e, that causally precedes a visible or 

commit event e, process p executes acommit event e7 

such that ef happens-before (or atomic with} e, and 

i<j. 

This theorem dictates when processes must commit in 
order to ensure consistent recovery. At the heart of this theo- 
rem is the Save-work invariant, which informally states 
“each process has to commit all its non-deterministic events 
that causally precede visible or commit events”. We can fur- 
ther divide this invariant into separate rules, one that 
enforces the visible constraint of consistent recovery, and 
one that enforces the no-orphan constraint. If we follow the 
tule “commit every non-deterministic event that causally 
precedes a visible event”, we are assured that the applica- 
tion’s visible output will always extend a legal sequence of 
visible events. We'll call this the Save-work-visible invari- 
ant. If we follow the rule “commit every non-deterministic 
event that causally precedes a commit event”, we are assured 
that a finite number of stop failures cannot prevent the appli- 
cation from executing to completion. We’ll call this the 
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Process A 


ND 
Process B 


failure 


Figure 2: A problematic distributed computation. We see two 
processes’ timelines. The arrow between the processes represents a 
message from B to A. Black boxes represents commits. The event 
marked “ND” is a non-detertninistic event. Process A is «an orphan 
after Process B’s failure as A has committed a dependence on B’s 
lost non-deterministic event. 


Save-work-orphan invariant. To better understand this latter 
tule, consider the computation depicted in Figure 2. 

A process is called an orphan if it has committed a 
dependence on another process’s non-deterministic event 
that has been lost and may not be reexecuted. For example, 
Process A in Figure 2 is an orphan because it has committed 
its dependence on Process B’s lost non-deterministic event. 

An orphan can prevent an application from executing to 
completion when it is upholding Save-work-visible. Con- 
sider an orphan that has committed a dependence on a lost 
non-deterministic event e}” . If the orphan attempts to exe- 
cute a visible event e, Save-work-visible requires that pro- 
cess p commit e)”. However, since process p has already 
failed and aborted e}?”, it cannot commit it. Furthermore, 
since the orphan cannot abort its dependence on e}”, it can 
never execute e and the computation will not be able to com- 
plete. 

The remedy for this scenario is to uphold Save-work- 
orphan, which ensures that any non-deterministic event that 
causally precedes a commit is committed. 

We must make two assumptions for the Save-work 
Theorem to be necessary. We ensure the necessity of Save- 
work-visible by assuming that all non-deterministic events 
can cause inconsistency. We ensure the necessity of Save- 
work-orphan by assuming that all processes in the computa- 
tion affect the computation’s visible output. For the details of 
these assumptions as well as the proof of the Save-work The- 
orem, please see [22]. 


2.4. Upholding Save-work 


There are many ways an application can uphold the 
Save-work invariant to ensure consistent recovery for stop 
failures. For example, an application can execute a commit 
event for every event executed by the application. Although 
such a protocol will cause a very large number of commits, it 
has the advantage of being trivial to implement: the protocol 
does not need to figure out which events are non-determ inis- 
tic, or which events are visible. Even without knowing event 
types, it correctly upholds the Save-work invariant. 

Consider a protocol in which each process executes a 
commit event immediately after each non-deterministic 
event. In committing all non-deterministic events, this proto- 
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col will certainly commit those non-deterministic events that 
causally precede visible or commit events. Therefore it 
upholds Save-work and will guarantee consistent recovery. 
We call this protocol Commit After Non-Deterministic, or 
CAND. 

We can also uphold Save-work without knowing about 
the non-determinism in the computation. Under the Commit 
Prior to Visible or Send protocol (CPVS), each process com- 
mits just before doing a visible event or a send to another 
process. When a process commits before each of its visible 
events, it is assured that all its non-determinism that causally 
precedes the visible event is committed. If each process also 
commits before every send event, then it cannot pass a 
dependence on an uncommitted non-deterministic event to 
another process. Thus, CPVS also upholds Save-work. 

The Commit Between Non-Deterministic and Visible 
or Send (CBNDVS) protocol takes advantage of knowledge 
of both non-determinism and visible and send events in order 
to uphold Save-work. Under this protocol, each process 
commits immediately before a visible or send event if the 
process has executed a non-deterministic event since its last 
commit. 

Since commit events can involve writing lots of data to 
stable storage, they can be slow. Therefore, minimizing the 
number of commits executed can be important to failure-free 
performance. There exist several general techniques for min- 
imizing commits. 

Logging is a general technique for reducing an applica- 
tion’s non-determinism [12]. If an application writes the 
result of a non-deterministic event to a persistent log, and 
then uses that log record during recovery to ensure the event 
executes with the same result, the event is effectively ren- 
dered deterministic. Logging some of an application’s non- 
determinism can significantly reduce commit frequency. 
Logging all an application’s non-determinism lets the appli- 
cation uphold Save-work without committing at all. 

Tracking whether one process’s non-determinism caus- 
ally precedes events on another process can be complex. In 
fact, we can think of the CPVS protocol as pessimistically 
committing before send events rather than track causality 
between processes. However, applications can avoid com- 
mitting before sends without tracking causality by employ- 
ing a distributed commit, such as two-phase commit (2PC)— 
all processes would commit whenever any process does a 
visible event. Using two-phase commit can reduce commit 
frequency if visible events are less frequent than sends. 
Applications can further reduce commits by tracking causal- 
ity between processes, involving in the coordinated commit 
only those processes with relevant non-deterministic events. 

Not only can each of these protocols be viewed as a dif- 
ferent technique for upholding Save-work, but so can all 
existing protocols from the recovery literature. 

For example, pure message logging protocols make all 
message receive events deterministic, allowing applications 
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whose only non-deterministic events are receives to uphold 
Save-work without committing. The different message log- 
ging protocols differ in how the logging is carried out. For 
example, Sender-based Logging (SBL) protocols keep the 
log record for the receive event in the volatile memory of the 
sender [15], while Family-based Logging (FBL) keeps log 
entries in the memory of downstream processes [2]. 

In the Manetho system, each process maintains log 
records for all the non-deterministic events it depends on in 
an antecedence graph. When a process wants to execute a 
visible event, it upholds Save-work by writing the anteced- 
ence graph to stable storage [11]. In the Optimistic Logging 
protocol, processes write log records to stable storage asyn- 
chronously (28]. When a process wants to do a visible event, 
it upholds Save-work by first waiting for all relevant log 
records to make it to disk. 

The Targon/32 system attempts to handle more non- 
determinism than these other logging protocols [4]. All 
sources of non-determinism except signals are converted into 
messages that are logged in the memory of a backup process 
on another processor. Whenever a signal is delivered (an 
event that remains non-deterministic), Targon/32 forces a 
commit to uphold Save-work. The Hypervisor system logs 
all sources of non-determinism using a virtual machine 
under the operating system [5]. 

Under a Coordinated Checkpointing protocol, a pro- 
cess executing a visible event essentially assumes that all 
processes in the computation with which it has recently com- 
municated have executed non-deterministic events that caus- 
ally precede the visible event [18]. To uphold the Save-work 
invariant, the process executing the visible event initiates an 
agreement protocol to force all these other processes to com- 
mit. 

Each of these recovery protocols represents a different 
technique for upholding Save-work. Each to varying degrees 
trades off programmer effort and system complexity for 
reduced commit frequency (and hopefully overhead). 

Some protocols focus their effort to reduce commit fre- 
quency on the challenge of identifying and reducing non- 
determinism. Others endeavor to use knowledge of an appli- 
cation’s visible events. Still others do some of each. Each 
protocol can be seen as representing a point in a two-dimen- 
sional space of protocols. One axis in the space represents 
effort made to identify and possibly convert application non- 
determinism. The other axis represents effort made to iden- 
tify visible events and to commit as few non-visible events as 
possible. 

Such a protocol space is useful because it helps us 
understand the relationships between historically disparate 
protocols and to identify new ones. Figure 3 shows how the 
protocols we have described in this section might appear in 
such a protocol space. 

A protocol falling at the origin of the space would 
uphold Save-work by committing every event executed by 
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igure 3: Protocol space. All consistent recovery protocols fall 
omewhere in this space. Some protocols focus on dealing with 
on-determinism, while others concern themselves with visible 
vents. Some do a little of each. 


each process, exerting no effort to determine which events 
are non-deterministic or visible. As protocols fall further out 
the horizontal axis, they make sufficient effort to recognize 
that some events are deterministic and therefore do not 
require commits. At the point occupied by CAND, the proto- 
col makes sufficient effort to distinguish all of the applica- 
tion’s deterministic and non-deterministic events, executing 
a commit only after non-deterministic ones. Beyond that 
point, the protocols begin to employ logging, exerting effort 
to convert more and more of the application’s non-determin- 
istic events into deterministic ones. A protocol in that portion 
of the space forces a commit only when the application exe- 
cutes some unlogged non-determinism. At the point occu- 
pied by Hypervisor, the protocol makes sufficient effort to 
log all non-determinism, never forcing a commit. 

For the vertical axis, we can think of the protocol at the 
origin as committing all events rather than exert the effort 
needed to determine which events are visible. Protocols fall- 
ing further up the axis exert more effort to avoid committing 
events that are not visible. At the point occupied by CPVS, 
protocols commit only the true visible events and send 
events—committing before sends takes less effort than track- 
ing whether that send leads to a visible event on another pro- 
cess. Protocols falling yet further up in the space (such as 
Coordinated Checkpointing) are able to ask remote pro- 
cesses to commit if needed. Under those protocols, applica- 
tions are forced to commit before visible events only. 

Some protocols fall in the middle of the space, apply- 
ing techniques both for identifying and converting non-deter- 
minism, as well as for tracking the causal relationship 
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Figure 4: Protocol space with other design variables. 


between non-deterministic events and the visible events they 
cause. 

Although all protocols in the space are equivalent in 
terms of upholding Save-work, they do differ in terms of 
other design variables. As shown in Figure 4, we can map 
trends in several important design variables onto the protocol 
space. 

The farther a protocol falls from the origin, the lower 
its commit frequency is likely to be, and therefore, the better 
its performance. However, this improved performance comes 
at the expense of simplicity and reliability. Protocols close to 
the origin are very simple to implement, and therefore are 
more likely to be implemented correctly. 


For protocols that fall on the vertical axis, the recovery 
system needs only rollback failed processes and let them 
continue normally. Protocols further to the right in the proto- 
col space have longer recovery times because after rollback, 
the recovery system must for some time constrain reexecu- 
tion to follow the path taken before the failure. 


The further a protocol falls from the horizontal axis, the 
more non-determinism it safely leaves in the application. As 
we will discuss in Section 2.6, the more non-determinism in 
an application, the better the chance it will survive propaga- 
tion failures. 


2.5. Failure transparency for stop and propagation 
failures 


As mentioned in Section 2.2, failures can take two 
forms: stop failures and propagation failures. Upholding the 
Save-work invariant is enough to guarantee consistent recov- 
ery only in the presence of stop failures. To illustrate this 
observation, consider a protocol that commits all events a 
process executes. This protocol clearly upholds Save-work. 
However, if the process experiences a propagation failure 
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igure 5: Sample propagation failure timeline. A non-deterministic 
vent e causes buffer initialization to overflow andtrash a pointer. A 
ommit any time after e will prevent recovery from this failure. 





(which by definition involves executing buggy events), this 
protocol is guaranteed to commit buggy state. As a result, the 
process will fail again during recovery, and the application 
will never be able to complete after the failure. 

Thus, in order to guarantee consistent recovery in the 
presence of propagation failures, an application must not 
only commit to uphold Save-work, but when it commits it 
must avoid preserving the conditions of its failure. In this 
section we examine what exactly an application must do to 
guarantee consistent recovery in the presence of propagation 
failures. 

As was the case in our discussion of consistent recov- 
ery, non-deterministic events are central to the issue of 
recovering from propagation failures. Imagine an application 
that, as a result of non-deterministic event e, overruns a 
buffer it is clearing and zeroes out a pointer down the stack 
(see Figure 5). Later, it attempts to dereference the pointer 
and crashes. Obviously if the application commits after zero- 
ing the pointer, recovery is doomed. However, if the applica- 
tion commits any time before zeroing the pointer and after e, 
recovery will still be doomed if there are no other non-deter- 
ministic events after e. In this case, the pointer is not cor- 
rupted in the last committed state, but it is guaranteed to be 
re-corrupted during recovery. 

Note that had the application committed just before e 
and not after, all could be well. During recovery, the applica- 
tion would redo the non-deterministic event which could 
execute with a different result and avoid this failure alto- 
gether. 

Thus non-determinism helps our prospects for recover- 
ing from propagation failures by limiting the scope of what 
is preserved by a commit. 

But, not all non-determinism is created equal in this 
regard. In building up the Save-work invariant, we conserva- 
tively treated as non-deterministic any event that could con- 
ceivably have a different result during recovery. However, 
some non-deterministic events are likely to have the same 
result before and after a failure, and the recovery system can- 
not depend on these events to change after recovery. We will 
called these events fixed non-deterministic events. 

A common example of a fixed non-deterministic event 
is user input. We cannot depend on the user to aid recovery 
by entering different input values after a failure. Other exam- 
ples of fixed non-deterministic events include non-determin- 
istic events whose results are based on the fullness of the 
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Figure 6: Three sample machines with crash events (events that 
end states filled black). It is okay to commit in case B at the point 
marked. Committing either A or C where marked could prevent 
recovery. 


disk (such as the write system call), or that depend on the 
number of slots left in the operating system’s open file table 
(such as the open system call). 


Non-deterministic events that are not fixed we will call 
transient non-deterministic events. Scheduler decisions, sig- 
nals, message ordering, the timing of user input, and system 
calls like gett imeofday are all transient non-determinis- 
tic events. 

We need to incorporate into our computational model a 
way to represent the eventual crash of a process during a 
propagation failure. We will model a process’s crash as the 
execution of a crash event. When a process executes a crash 
event, it transitions into a state from which it cannot continue 
execution. In the example in Figure 5, the crash event is the 
dereferencing of the null pointer. 

As mentioned above, an untimely commit during a 
propagation failure can ensure that recovery fails. Let us 
examine in more detail when a process should not commit. 


Clearly a process should not commit while executing a 
string of deterministic events that end in a crash event. Doing 
so is guaranteed to either commit the buggy state that leads 
to the crash, or to ensure that the faulty state is regenerated 
during recovery. This case is shown in Figure 6A. 


However, a process can safely commit before a tran- 
sient non-deterministic event as long as at least one of the 
possible results of that event does not lead to the execution of 
a crash event (see Figure 6B). 


How about committing before a fixed non-determinis- 
tic event where one of the event’s possible results leads to a 
crash? This case is shown in Figure 6C. If the application 
commits before the fixed non-deterministic event, recovery 
is possible only if the event executes with a result that leads 
down the path not including the crash event. If the applica- 
tion is unlucky and the fixed non-deterministic event sends 
the application down the path towards the crash, the commit 
will ensure recovery always fails. Since we cannot rely on 
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Figure 7: Portion of a state machine, its crash events, and 
corresponding dangerous paths. Crash events are those that end in 
states filled black. Fixed non-deterministic events are marked with 
a slash. The shaded paths are dangerous. 


fixed non-deterministic events having results conducive to 
recovery, we cannot commit before any fixed non-determin- 
istic events that might lead to a crash. 

We can infer that some paths through a portion of a 
state machine are problematic for handling propagation fail- 
ures—committing anywhere along the paths could prevent 
recovery. We next present an algorithm for finding these 
paths. For this discussion, we assume perfect knowledge of 
each process’s crash events. We recognize that this is not 
practical—if we knew all the crash events, we could likely 
fix all the bugs! However, making this assumption will help 
us to analyze when recovery is possible with the best possi- 
ble knowledge. 

Given a single process’s state machine and its crash 
events: 


Single-Process Dangerous 
Paths Algorithm 


* Color all crash events in the state machine. 


* Color an event e if all events out of e’s end state are 
colored. 


* Color an event e if at least one event out of e’s end 
state is colored and is a fixed non-deterministic event. 


We call all the paths in the state machine colored by 
this algorithm dangerous paths. A portion of a state machine 
with its dangerous paths highlighted is shown in Figure 7. 

We now present without proof a theorem which gov- 
erns when recovery is possible in the presence of propaga- 
tion failures. 


Lose-work Theorem 


Application-generic recovery from propagation failures 
is guaranteed to be possible if and only if the applica- 
tion executes no commit event on a dangerous path. 
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This theorem provides an invariant for ensuring the 
possibility of recovery from propagation failures: processes 
must not commit on dangerous paths. It is interesting to note 
that the location of the initial bug that caused the crash is sur- 
prisingly irrelevant. In the end, all that matters is the eventual 
crash event (or events) that result from that bug and its loca- 
tion relative to the application’s transient non-deterministic 
events. 

How about for multi-process applications? The chal- 
lenge for distributed applications is in computing their dan- 
gerous paths. Unlike the dangerous paths algorithm 
presented above, computing dangerous paths for a distrib- 
uted application cannot be done statically: whether one pro- 
cess’s path is dangerous can depend on the paths taken by the 
other processes in the computation and where they have 
committed. 

Given a process P that wants to determine its dangerous 
paths (presumably so it can commit without violating Lose- 
work): 


Multi-Process Dangerous 
Paths Algorithm 


¢ Process P collects a snapshot of when each process in 
the computation last committed. 


For each non-deterministic receive event that P has 
executed, treat that receive as a transient non-deter- 
ministic event if the sender’s last commit occurred 
before the send, and the sender executed a transient 
non-deterministic event between its last commit and 
the send. All other receives P has executed are fixed 
non-deterministic events. 


Run the single-process dangerous paths algorithm to 
compute P’s dangerous paths. 


2.6. Upholding Lose-work 


The simplest way to uphold Lose-work is to ensure that 
no process ever commits. Although this solution has the 
advantage of requiring no application-specific knowledge to 
implement, it also prohibits guaranteeing consistent recov- 
ery. 

Clearly, without perfect knowledge of the application’s 
non-determinism and crash events it is impossible to guaran- 
tee a committing application upholds Lose-work. Despite the 
impossibility of directly upholding the invariant, we can use 
the Lose-work Theorem to draw some conclusions about 
recovering from propagation failures. 

First, we observe that it is impossible to uphold both 
Save-work and Lose-work for some applications. Consider 
an application with a visible event on a dangerous path. The 
dangerous path will extend back at least to the last non-deter- 
ministic event. Upholding Save-work forces the application 
to commit between the last non-deterministic event and the 
visible event, which will violate Lose-work. 

Second, some protocols designed to uphold Save-work 
for stop failures guarantee that applications will not recover 
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from propagation failures. These protocols either commit or 
convert all non-determinism, ensuring a commit after the 
non-deterministic event that steers a process onto a danger- 
ous path, thus violating Lose-work. CAND, Sender-based 
logging, Targon/32, and Hypervisor are all examples of pro- 
tocols that prevent applications from surviving propagation 
failures. Indeed, any protocol that falls on the horizontal axis 
of the Save-work protocol space (see Figure 3) will prevent 
upholding Lose-work. The farther a protocol falls from the 
horizontal axis, the more it focuses its attention on handling 
visible events and the more non-determinism it leaves safely 
uncommitted, thus decreasing the chances of violating Lose- 
work (see Figure 4). 


Although directly upholding Lose-work is impossible, 
some applications with mostly “non-repeatable” bugs (so 
called “Heisenbugs” [13]) may be able to commit with a low 
probability of violating the invariant. There are also a num- 
ber of ways applications can deliberately endeavor to mini- 
mize the chance that one of their commits causes them to 
violate Lose-work. 


First, applications should try to crash as soon as possi- 
ble after their bugs get triggered. Doing so shortens danger- 
ous paths and thus lowers the probability of the application 
committing while executing on one. In order to move crashes 
sooner, processes can try to catch erroneous state by per- 
forming consistency checks. For example, a process could 
traverse its data structures looking for corruption, it could 
compute a checksum over some data, or it could inspect 
guard bands at the ends of its buffers and malloc’ed data. 
Voting amongst independent replicas is a general but expen- 
sive way to detect erroneous execution [27]. When a process 
fails one of these checks, it simply terminates execution, 
effectively crashing. 


Although it is a good idea for processes to perform 
these consistency checks frequently, performing them right 
before committing is particularly important. 


Applications may also be able to reduce the likelihood 
they will violate Lose-work by not committing all their state. 
Applications may have knowledge of which data absolutely 
must be preserved, and which data can be recomputed from 
an earlier (hopefully bug-free) state. Should a bug corrupt 
state that is not written to stable storage during commit, 
recomputing that state after a failure leaves open the possi- 
bility of not retriggering the bug. 

Applications can also try to commit as infrequently as 
possible. When upholding Save-work, applications should 
do so with a protocol that commits less often and that leaves 
as much non-determinism as possible. Some applications 
may be able to add non-determinism to their execution, or 
they may be able to choose a non-deterministic algorithm 
over a deterministic one. 

The application or the operating system may also able 
to make some fixed non-deterministic events into transient 
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ones by increasing disk space or other application resource 
limits after a failure. 

In Section 4 we will measure how often several appli- 
cations violate Lose-work in the process of upholding Save- 
work. 


3. Cost of Upholding Save-work 

In Section 2.3, we presented the Save-work invariant, 
which applications can uphold to guarantee consistent recov- 
ery in the presence of stop failures. However, we have not 
talked about the performance penalty applications incur to 
uphold it. As mentioned above, executing commits can be 
expensive. It may be the case for real applications that adher- 
ing to Save-work may be prohibitively expensive. In this sec- 
tion we measure the performance penalty incurred for 
several real applications upholding Save-work. 

For this experiment we have selected four real applica- 
tions: nvi, magic, xpilot and TreadMarks. nvi is a public 
domain version of the well known Unix text editor vi. sagic 
is a VLSI CAD tool. xpilot is a distributed, multi-user game. 
Finally, TreadMarks is a distributed shared memory system. 
Within TreadMarks’s shared memory environment we run an 
N-body simulation called Barnes-Hut. 

Of these applications, all but TreadMarks are interac- 
tive. We chose mainly interactive applications for several 
reasons. First, interactive applications are important recipi- 
ents of failure transparency (when these applications fail 
there is always an annoyed user nearby). Second, interactive 
applications have been little studied in recovery literature. 
Finally, interactive applications can be hard to recover: they 
have copious system state, non-determinism, and visible out- 
put, all of which requiring an able recovery system. 

TreadMarks and xpilot are both distributed applica- 
tions, while the others are single-process. 

To recover these applications we run them on top of 
Discount Checking, a system designed to provide failure 
transparency efficiently using lightweight, full-process 
checkpoints (24]. Discount Checking is built on top of reli- 
able memory provided by the Rio File Cache [9], and light- 
weight transactions provided by the Vista transaction library 
[23]. 

In order to preserve the full user-level state of a pro- 
cess, Discount Checking maps the process’s entire address 
space into a segment of reliable memory managed by Vista. 
Vista traps updates to the process’s address space using 
copy-on-write, and logs the before-images of updated 
regions to its persistent undo log. To capture the application 
state in the register file (which cannot be mapped into persis- 
tent memory), Discount Checking copies the register file into 
a persistent buffer at commit time. Thus, taking a checkpoint 
amounts to copying the register file, atomically discarding 
the undo log, and resetting page protections. 

Although the steps outlined so far will allow Discount 
Checking to checkpoint and recover user-level state, Dis- 
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count Checking must also preserve and recover the applica- 
tion’s kernel state. To capture system state, the library 
implements a form of copy-on-write for kernel data: it traps 
system calls, copies their parameter values into persistent 
buffers, and then uses those parameter values to directly 
reconstruct relevant kernel state during recovery. For more 
on the inner workings of Discount Checking, please see [24]. 

As mentioned in Section 2.4, there exist a large variety 
of protocols for upholding Save-work. In order to get a sense 
of which work best for our suite of applications, we imple- 
mented seven different protocols within Discount Checking. 
Ourcore protocols are CAND, CPVS, and CBNDVS, which 
we described in Section 2.4. Recall that CAND upholds 
Save-work by committing immediately after every non- 
deterministic event. CPVS commits just before all visible 
and send events. CBNDVS commits before a visible or send 
event if the process has executed a non-deterministic event 
since its last commit. We also added to Discount Checking 
the ability to log non-deterministic user input and message 
receive events to render them deterministic, as well as the 
ability to use two-phase commit so one process can safely 
pass a dependency on an uncommitted non-deterministic 
event to another process. Adding these techniques to our 
core protocols yielded an additional four protocols: CAND- 
LOG, CBNDVS-LOG, CPV-2PC, and CBNDV-2PC. For 
example, CAND-LOG executes a commit immediately after 
any non-deterministic event that has not been logged. CPV- 
2PC commits all processes whenever any process executes a 
visible, but does not need to commit before a process does a 
send. 

In order to implement these protocols, Discount Check- 
ing needs to get notification of an application’s non-deter- 
ministic, visible, and send events. To learn of an 
application’s non-deterministic events, Discount Checking 
intercepts a process’s signals and non-deterministic system 
calls such as gettimeofday, bind, select, read, 
recvmsg, recv, and recvfrom. To learn of a process’s 
visible and send events, Discount Checking intercepts calls 
to write, send, sendto, and sendmsg. 

In addition to measuring the performance of our appli- 
cations on Discount Checking on Rio, we wanted to get a 
sense of how our applications performed using a disk-based 
recovery system. We created a modified version of Discount 
Checking called DC-disk that wrote out a redo log synchro- 
nously to disk at checkpoint time. Although we did not add 
the code needed to let DC-disk truncate its redo log, or even 
properly recover applications, its overhead should be repre- 
sentative of what a lightweight disk-based recovery system 
can do. 

We ran our experiments on 400 MHz Pentium II com- 
puters each with 128 MB of memory (100 MHz SDRAM). 
Each machine runs FreeBSD 2.2.7 with Rio and is connected 
to a 100 Mb/s switched Ethernet. Rio was tumed off when 
using DC-disk. Each computer has a single IBM Ultrastar 
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Figure 8: Performance of several protocols for four applications. 
Each application has its own protocol space. At each point in each 
space, we list the protocol at that point, the number of checkpoints 
in the complete run of the application, and the runtime overhead for 
Discount Checking, and for DC-disk. For xpifot we list the 
protocol, number of checkpoints per second, followed by 
sustainable frame rate for Discount Checking and DC-disk. Full 
speed for xpilot is 15 frames per second. 


DCAS-34330W ultra-wide SCSI disk. All points represent 
the average of five runs. The standard deviation for each data 
point was less than 1% of the mean for Discount Checking, 
and less than 4% of the mean for DC-disk. The distributed 
workloads (TreadMarks and xpilot) were both run on four 
computers. We simulate fast interactive rates by delaying 
100 ms between each keystroke in nvi and by delaying | sec- 
ond between each mouse-generated command in magic. 

We present the result of our runs in Figure 8. For each 
application we show the protocol space developed in Section 
2.4. In each application’s protocol space we plot the protocol 
used for each data point, and the number of checkpoints 
taken during the complete run of the application when run- 
ning on that protocol. For each protocol’s data point we also 
show the percent expansion in execution time that protocol 
caused compared to an unrecoverable version of the applica- 
tion, first for Discount Checking, then for DC-disk. 

Because xpilot is a real-time, continuous program we 
report its performance as the frame rate it can sustain rather 
than runtime overhead. Higher frame rates indicate better 
interactivity, with full speed being 15 frames-per-second. 
xpilor’s number of checkpoints is given as the largest check- 
pointing frequency (in checkpoints per second) amongst its 
processes. 

We can make a number of interesting observations 
based on these results. As expected, commit frequency gen- 
erally decreases, and performance increases, with radial dis- 
tance from the origin. The sole exception to this rule is 
xpilot, where having all processes commit whenever any one 
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of them wants to execute a visible event (as is done in proto- 
cols using two-phase commit) results in a net increase in 
commit frequency. 


Despite the fact that several of these applications gener- 
ate many commits, there is at least one protocol for each 
application with very low overhead for Discount Checking. 
We conclude that the cost of upholding Save-work using 
Discount Checking on these applications is low. 


For all the interactive applications, the overhead of 
using DC-disk is not prohibitive. We see overhead of 12% 
and 27% for nvi and magic respectively. xpilot is able to sus- 
tain a usable 9 frames per second. On the other hand, no pro- 
tocol for DC-disk was able to keep up with TreadMarks. 
From our experiments, we conclude that Save-work can be 
upheld with a disk-based recovery system for many interac- 
tive applications with reasonably low overhead. 


We observe that the protocols that perform best for 
each application are the ones that exploit the infrequent class 
of events for that application in deciding when to commit. 
For example, TreadMarks has very few visible events, 
despite having copious non-deterministic and send events. 
For it, the 2PC protocols which let it commit only for the 
rare visible events are the big win. 


While overhead is low for many applications, we can 
conceive of applications for which Save-work incurs a large 
performance overhead. These applications would have copi- 
ous visible and non-deterministic events—that is, no rare 
class of events—and they would be compute bound rather 
than user bound. Applications that might fall into this cate- 
gory include interactive scientific or engineering simulation, 
online transaction processing, and medical visualization. 


4. Measuring Conflict between the Save-work 
and Lose-work Invariants 


Guaranteeing consistent recovery in the presence of 
stop and propagation failures requires upholding both the 
Save-work and Lose-work invariants. Unfortunately, some 
failure scenarios make it impossible to uphold both invari- 
ants simultaneously. 


For example, consider the failure timeline shown in 
Figure 9. In this timeline, the application executes a transient 
non-deterministic event that causes it to execute down a code 
path containing a bug. The application eventually executes 
the buggy code (shown as “fault activation”), then correctly 
executes a visible event. After this visible event, the program 
crashes. Section 2.5’s coloring algorithm shows that the 
entire execution path from the transient non-deterministic 
event to the crash forms a dangerous path, along which the 
Lose-work invariant prohibits a commit. Unfortunately, the 
Save-work invariant specifically requires a commit between 
the transient non-deterministic event and the visible event. 
For this application, both invariants cannot be upheld simul- 
taneously. 
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igure 9: Failure timeline in which the Save-work invariant and the 
ose-work invariant conflict. The shaded portion is the dangerous 
ath. 


Some applications may have bugs that prevent uphold- 
ing Lose-work even without committing to uphold Save- 
work. For example, many applications contain repeatable 
bugs (so called, “Bohrbugs”[13]). With these faults it is pos- 
sible to execute from the initial state of the program to the 
bug without ever executing a transient non-deterministic 
event. In other words, the dangerous path resulting from the 
bug extends all the way back to the initial state of the pro- 
gram. And since the initial state of any application is always 
committed, applications with Bohrbugs inherently violate 
Lose- work. 

In this section, we endeavor to examine how often in 
practice faults cause a fundamental conflict between the 
Save-work and Lose-work invariants. Our focus is on soft- 
ware faults (both in the application and operating system), 
which field studies and everyday experience teach is the 
dominant cause of failures today [13]. 


4.1. Application faults 


We would like to measure how often upholding Save- 
work forces an application to commit on a dangerous path, 
like the application depicted in Figure 9. We divide this 
problem into three subproblems. First, how often does an 
application bug create a dangerous path beginning at the start 
state of the application? As described above, this scenario 
arises from Bohrbugs in the application. Second, given an 
application fault that does depend on a transient non-deter- 
ministic event (a Heisenbug), how often is the application 
forced to commit between the transient non-deterministic 
event at the beginning of the dangerous path and the fault 
activation? Third, how often is the application forced to 
commit between the fault activation and the crash? We 
examine this third question first using a fault-injection study. 


Our strategy is to force crashes of real applications, 
recover the applications, and measure after the fact whether 
any of their commits to uphold Save-work occurred between 
fault activation and the crash. We induce faults in the appli- 
cation by running a version of the application with changes 
in the source code to simulate a variety of programming 
errors. These errors include actions like overwriting random 
data in the stack or heap, changing the destination variable, 
neglecting to initialize a variable, deleting a branch, deleting 
a random line of source code, and off-by-one errors in condi- 
tions like >= and <. See [6] for more information on our fault 
model. We only consider runs where the program crashes. 
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‘Fable 1: Fraction of application faults in vi and postgres that 
violate Lose-work by committing after the fault is activated. For 
each fault type we list the percent of crashes by that fault that 
commit after the fault is activated. Over all fault types, nvi and 
postgres commit after the fault activation for 37% and 33% of all 
crashes respectively. 


Checkpointing and recovery for the applications is pro- 
vided by Discount Checking using the CPVS protocol. 
CPVS is the best protocol possible for not violating Lose- 
work for non-distributed applications. For our experiments, 
we use two applications: the Unix text editor nvi, and post- 
gres, a large, publicly available relational database. These 
two applications differ greatly in their code size and amount 
of data they touch while executing. 

We detect a run in which the application commits 
between fault activation and the crash by instrumenting Dis- 
count Checking to log each fault activation and commit 
event. If the program commits after activating the fault, it has 
violated the Lose-work invariant. We also conduct an end-to- 
end check of this criteria by suppressing the fault activation 
during recovery, recovering the process, and trying to com- 
plete the run. As expected, we found that runs recovered 
from crashes if and only if they did not commit after fault 
activation. 

We collected data from approximately 50 crashes for 
each fault type. Table 1 shows the fraction of crashes that 
violated the Lose-work invariant by committing after fault 
activation. For both nvi and postgres, approximately 35% of 
faults caused the process to commit along this portion of the 
dangerous path. While not included in the table, 7-9% of the 
runs did not crash but resulted in incorrect program output. 

We next turn our attention to question one, namely, for 
what fraction of bugs does the dangerous path extend back to 
the initial state of the program? That is, of the bugs users 
encounter, what portion are deterministic (Bohrbugs), and 
what portion depend on a transient non-deterministic event 
(Heisenbugs)? Although it is difficult to measure this frac- 
tion directly, several prior studies have attempted to shed 
light on this issue. 
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Chandra and Chen showed that for Apache, GNOME, 
and MySQL, three large, publicly available software pack- 
ages, only 5-15% of the bugs in the developer’s bug log were 
Heisenbugs (for shipping versions of the applications) [7]. 
The remaining bugs were Bohrbugs. Most of these determin- 
istic bugs resulted from untested boundary conditions (e.g. 
an older version of Apache crashed when the URL was too 
long). Several other researchers have found a similarly low 
occurrence (5-29%) of application bugs that depend on tran- 
sient non-deterministic events like timing [20, 29, 30]. Note 
that these results conflict with the conventional wisdom that 
mature code is populated mainly by Heisenbugs—it has been 
held that the easier-to-find Bohrbugs will be captured more 
often during development [13, 17]. It appears that for non- 
mission-critical applications, the current software culture 
tolerates a surprising number of deterministic bugs. 

We have yet to tackle the second question, which asks 
how often an application is forced to commit on the danger- 
ous path between the transient non-deterministic event and 
fault activation (see Figure 9). Unfortunately, we are unable 
to measure this frequency using our fault-injection technique 
because no realistic model exists for placing injected bugs 
Telative to an application’s transient non-deterministic 
events. However, as we will see, the case for generic recov- 
ery from application failures is already sufficiently discour- 
aging, even optimistically assuming no commits on this 
portion of the dangerous path. 

We would like to compose these separate experimental 
results in order to illuminate the overarching question of this 
section. Our fault-injection study shows that nvi and postgres 
violate Lose-work in at least 35% of crashes from non-deter- 
ministic faults. If we assume the same distribution of deter- 
ministic and non-deterministic bugs in nvi and postgres as 
found in Apache, GNOME, and MySQL by Chandra and 
Chen, these non-deterministic faults make up only 5-15% of 
crashes. Therefore, Lose-work is upheld in at most 65% of 
15%, or 10% of application crashes. Lose-work and Save- 
work appear to conflict in the remaining 90% of failures by 
these applications. While extrapolating other applications’ 
fault distributions to nvi and postgres is somewhat question- 
able, as is generalizing to all applications from the measure- 
ment of two, these preliminary results raise serious questions 
about the feasibility of generic recovery from propagation 
failures. 


4.2. Operating systems faults 


Although failures due to application faults appear to 
frequently violate Lose-work, we can hope for better news 
for faults in the operating system. In contrast to application 
faults, not all operating system faults cause propagation fail- 
ures: some crash the system before they affect application 
state. Commits at any time by the application are okay in the 
presence of these stop failures. Thus if failures by the operat- 
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Table 2: Percent of OS faults in which nvi and postgres failed to 
recover. We list the percentage of crashes that led to failures during 
recovery for each fault type and over all failures. 


Average 








ing system usually manifest as stop failures, we would rarely 
observe system failures causing Lose-work violations. 

We wanted to measure the fraction of operating system 
failures for which applications are able to successfully 
recover. This fraction will include cases where the operating 
system experienced a stop failure, as well as cases in which 
the system failure was a propagation failure and the applica- 
tion did not violate Lose-work. 

In order to perform this measurement, we again use a 
fault-injection study. This time we inject faults into the run- 
ning kernel rather than into the application [9]. 

We again ran nvi and postgres with Discount Checking 
upholding Save-work using CPVS. For each run, we started 
the application and injected a particular type of fault into the 
kernel. We discarded runs in which neither the system nor 
the application crashed. If either the operating system or the 
application crashed, we rebooted the system and attempted 
to recover the application. We repeated this process until 
each fault type had induced approximately 50 crashes. The 
results of this experiment are shown in Table 2. 

Of the 350 operating system crashes we induced for 
each application, we found that nvi failed to properly recover 
in 15% of crashes. postgres did better, only failing to recover 
3% of the time. These numbers are encouraging: application- 
generic recovery is likely to work for operating systems fail- 
ures, despite the challenge of upholding Lose-work. 

If we assume that all propagation failures will violate 
Lose-work with the probabilities in Table 1 (regardless of 
whether the propagation failure began in the operating sys- 
tem or application), we can infer how often system failures 
manifest as propagation failures in our experiments. Com- 
bining our application crash results with our operating sys- 
tem crash results implies that for nvi, 41% of system failures 
were propagation failures. For postgres, 10% of system fail- 
ures manifest as propagation failures. We hypothesize that 
the proportion of propagation failures differs for the two 
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applications because of the different rate at which they com- 
municate with the operating system: the non-interactive ver- 
sion of nvi used in our crash tests executes almost 10 times as 
many system calls per second as postgres executes. 


5. Related Work 


Many fault-tolerant systems are constructed using 
transactions to aid recovery. Transactions simplify recovery 
by grouping separate operations into atomic units, reducing 
the number of states from which an application must recover 
after a crash. However, the programmer must still bear the 
responsibility for building recoverability into his or her 
applications, a task that is difficult even with transactions. 
We have focused on higher-level application-generic tech- 
niques that absolve programmers from adding recovery abil- 
ities to their software. However, we use transactions to 
implement our abstraction. 


A number of researchers have endeavored to build sys- 
tems that provide some flavor of failure transparency for stop 
failures [3, 4, 5, 11, 14, 21, 25, 26]. Our work extends their 
work by analyzing propagation failures as well. 


The theory of distributed recovery has been studied at 
length [10]. Prior work has established that committed states 
in distributed systems must form a consistent cut to prevent 
orphan processes [8], that recoverable systems must preserve 
a consistent cut before visible events [28], and that non- 
determinism bounds the states preserved by commits [11, 
16]. Our Save-work invariant is equivalent to the confluence 
of these prior results. 


The Save-work invariant contributes to recovery theory 
by expressing the established rules for recovery in a single, 
elemental invariant. Viewing consistent recovery through the 
fens of Save-work, we exposed the protocol space and the 
telationships between the disparate protocols on it, as well as 
several new protocols. 


To the best of our knowledge, no prior work has pro- 
posed an invariant for surviving propagation failures that 
relates all relevant events in a process, nor has any prior 
work attempted to evaluate the fraction of propagation fail- 
ures for which consistent recovery is not possible. 


CAND, CPVS, and CBNDVS all bear a resemblance to 
simple communication-induced checkpointing protocols 
(CIC) [1]. However there are some important differences. 
First, all CIC protocols assume no knowledge of application 
non-determinism. As a result, they are forced to roll back 
any process that has received a message from an aborted 
sender. Commits under these protocols serve primarily to 
limit rollback distance, and to prevent the domino effect. In 
contrast, our protocols all to varying degrees make use of 
knowledge of application non-determinism. Rather than 
abort the receivers of lost messages, they allow senders to 
deterministically regenerate the messages. Under our proto- 
cols, only failed processes are forced to roll back. 
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Recovery systems often depend on the assumption that 
applications will not commit faulty state—a so called “fail- 
stop assumption” [27]. Our examination of propagation fail- 
ures amounts to a fine parsing of the traditional fail-stop 
assumption in which we consider a single commit’s ability to 
preserve not just past execution, but all future execution up 
to the next non-deterministic event. Making a fail-stop 
assumption in the presence of propagation failures is the 
same as assuming that applications can safely commit at any 
time without violating the Lose-work invariant. 


6. Conclusion 

The lure of operating systems that conceal failures is 
quite powerful. After all, what user or programmer wants to 
be burdened with the complexities of dealing with failures? 
Ideally, we could handle all those complexities once and for 
all in the operating system. 

Our goal with this paper has been to explore the subject 
of failure transparency, looking at what it takes to provide it 
and exposing the circumstances where providing it is not 
possible. We find that providing failure transparency in gen- 
eral involves upholding two invariants, a Save-work invariant 
which constrains when an application must preserve its work 
before a failure, and a Lose-work invariant which constrains 
how much work the application has to throw away after a 
failure. 

For stop failures, which do not require upholding Lose- 
work, the picture is quite rosy. We show that Save-work can 
be efficiently upheld for a variety of real applications. Using 
a transparent recovery system based on reliable memory, we 
find overheads of only 0-12% for our suite of real applica- 
tions. We also find that disk-based recovery makes a credible 
go of it, with interactive applications experiencing only mod- 
erate overhead. 

Unfortunately, the picture is somewhat bleaker for sur- 
viving propagation failures. Guaranteeing that an application 
can recover from a propagation failure requires upholding 
our Lose-work invariant, and Save-work and Lose-work can 
directly conflict for some fault scenarios. In our measure- 
ments of application faults in nvi and postgres, upholding 
Save-work causes them to violate Lose-work for at least 35% 
of crashes. Even worse, studies have suggested that 85-95% 
of application bugs today cause crashes that violate the Lose- 
work invariant by extending the dangerous path to the initial 
state. 

We conclude that providing failure transparency for 
stop failures alone is feasible, but that recovering from prop- 
agation failures requires help from the application. Applica- 
tions can help by performing better error detection, masking 
errors through N-version programming, reducing commit 
frequency by allowing the loss of some visible events, or 
reducing the comprehensiveness of the state saved by the 
recovery system. Our results point to interesting future work. 
Since pure application-generic recovery is not always possi- 


ble, what is the proper balance between generic recovery ser- 
vices provided by the operating system and application- 
specific aids to recovery provided by the programmer? 
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8. Software Availability 


The Rio version of the FreeBSD 2.2.7 kemel, as well as 
Vista and Discount Checking are all available for download 
at http://www.eecs.umich.edu/Rio. 
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Abstract 


The tradeoffs between consistency, performance, and 
availability are well understood. Traditionally, how- 
ever, designers of replicated systems have been forced 
to choose from either strong consistency guarantees or 
none at all. This paper explores the semantic space be- 
tween traditional strong and optimistic consistency mod- 
els for replicated services. We argue that an important 
class of applications can tolerate relaxed consistency, but 
benefit from bounding the maximum rate of inconsistent 
access in an application-specific manner. Thus, we de- 
velop a set of metrics, Numerical Error, Order Error, 
and Staleness, to capture the consistency spectrum. We 
then present the design and implementation of TACT, 
a middleware layer that enforces arbitrary consistency 
bounds among replicas using these metrics. Finally, we 
show that three replicated applications demonstrate sig- 
nificant semantic and performance benefits from using 
our framework. 


1 Introduction 


Replicating distributed services for increased avail- 
ability and performance has been a topic of considerable 
interest for many years. Recently however, exponen- 
tial increase in access to popular Web services provides 
us with concrete examples of the types of services that 
would benefit from replication, their requirements and 
semantics. One of the primary challenges to replicating 
network services is consistency across replicas. Provid- 
ing strong consistency (e.g., one-copy serializability [4]) 
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imposes performance overheads and limits system avail- 
ability. Thus, a variety of optimistic consistency mod- 
els (14, 15, 18, 31, 34] have been proposed for applica- 
tions that can tolerate relaxed consistency. Such models 
require less communication, resulting in improved per- 
formance and availability. 

Unfortunately, optimistic models typically provide no 
bounds on the inconsistency of the data exported to 
client applications and end users. A fundamental ob- 
servation behind this work is that there is a continuum 
between strong and optimistic consistency that is seman- 
tically meaningful for a broad range of network services. 
This continuum is parameterized by the maximum dis- 
tance between a replica’s local data image and some fi- 
nal image “consistent” across all replicas after all writes 
have been applied everywhere. For strong consistency, 
this maximum distance is zero, while for optimistic con- 
sistency it is infinite. We explore the semantic space in 
between these two extremes. Fora given workload, pro- 
viding a per-replica consistency bound allows the system 
to determine an expected probability, for example, that 
a write operation will conflict with a concurrent write 
submitted to a remote replica, or that a read operation 
observes the results of writes that must later be rolled 
back. No such analysis can be performed for optimistic 
consistency systems because the maximum level of in- 
consistency is unbounded. 

The relationship between consistency, availability, 
and performance is depicted in Figure 1(a). In moving 
from strong consistency to optimistic consistency, ap- 
plication performance and availability increases. This 
benefit comes at the expense of an increasing probabil- 
ity that individual accesses will return inconsistent re- 
sults, e.g., stale/dirty reads, or conflicting writes. In 
our work, we allow applications to bound the maxi- 
mum probability/degree of inconsistent access in ex- 
change for increased performance and availability. Fig- 
ure |(b) graphs different potential improvements in ap- 
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Figure |: a) The spectrum between strong and optimistic consistency as measured by a bound on the probability of 
inconsistent access. b) The tradeoff between consistency, availability, and performance depends upon application 


and network characteristics. 


plication performance versus the probability of incon- 
sistent access, depending on workload/network charac- 
teristics. Moving to the right in the figure corresponds 
to increased performance, while moving up in the figure 
corresponds to increased inconsistency. To achieve in- 
creased performance, applications must tolerate a corre- 
sponding increase in inconsistent accesses. The tradeoff 
between performance and consistency depends upon a 
number of factors, including application workload, such 
as read/write ratios, probability of simultaneous writes, 
etc., and network characteristics such as latency, band- 
width, and error rates. At the point labeled “1” in the 
consistency spectrum in Figure 1(b), a modest increase 
in performance corresponds to a relatively large increase 
in inconsistency for application classes corresponding to 
the top curve, perhaps making the tradeoff unattractive 
for these applications. Conversely, at point “2,” large 
performance increases are available in exchange for a 
relatively small increase in inconsistency for applica- 
tions represented by the bottom curve. 

Thus, the goals of this work are: i) to explore the is- 
sues associated with filling the semantic, performance, 
and availability gap between optimistic and strong con- 
sistency models, ii) to develop a set of metrics that allow 
a broad range of replicated services to conveniently and 
quantitatively express their consistency requirements, 
iii) to quantify the tradeoff between performance and 
consistency for a number of sample applications, and 
iv) to show the benefits of dynamically adapting consis- 
tency bounds in response to current network, replica, and 
client-request characteristics. To this end, we present 
the design, implementation, and evaluation of the TACT 
toolkit. TACT is a middleware layer that accepts specifi- 
cations of application consistency requirements and me- 
diates read/write access to an underlying data store. If 
an operation does not violate pre-specified consistency 
requirements, it proceeds locally (without contacting re- 


mote replicas). Otherwise, the operation blocks until 
TACT is able to synchronize with one or more remote 
replicas (i.e., push or pull some subset of local/remote 
updates) as determined by system consistency require- 
ments. 

We propose three metrics, Numerical Error, Order 
Error, and Staleness, to bound consistency. Numerical 
error limits the total weight of writes that can be applied 
across all replicas before being propagated to a given 
replica. Order error limits the number of tentative writes 
(sub ject to reordering) that can be outstanding at any one 
replica, and staleness places a real-time bound on the de- 
lay of write propagation among replicas. Algorithms are 
then designed to bound each metric: Numerical error is 
bounded using a push approach based solely on local 
information; a write commitment algorithm combined 
with compulsory write pull enforces order error bound; 
and staleness is maintained using real-time vector. To 
evaluate the effectiveness of our system, we implement 
and deploy across the wide area three applications with 
a broad range of dynamically changing consistency re- 
quirements using the TACT toolkit: an airline reserva- 
tion system, a distributed bulletin board service, and load 
distribution front ends to a Web server. Relative to strong 
consistency techniques, TACT improves the throughput 
of these applications by up to a factor of 10. Relative 
to weak consistency approaches, TACT provides strong 
semantic guarantees regarding the maximum inconsis- 
tency observed by individual read and write operations. 

The rest of this paper is organized as follows. Sec- 
tion 2 describes the three network services implemented 
in the TACT framework to motivate our system archi- 
tecture. Section 3 presents the system model and design 
we adopt for our target services. Next, Section 4 details 
the TACT architecture and Section 5 evaluates the per- 
formance of our three applications in the TACT frame- 
work. Finally, Section 6 places our work in the context 
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of related work and Section 7 presents our conclusions. 


2 Applications 
2.1 Airline Reservations 


Our first application is a simple replicated airline 
reservation system that is designed to be representative 
of replicated E-commerce services that accept inquiries 
(searches) and purchase orders on a catalog. In our im- 
plementation, each server maintains a full replica of the 
flight information database and accepts user reservations 
and inquiries about seat availability. Consistency in this 
application is measured by the percentage of requests 
that access inconsistent results. For example, in the face 
of divergent replica images, a user may observe an avail- 
able seat, when in fact the seat has been booked at an- 
other replica (false positive). Or a user may see a par- 
ticular seat is booked when in fact, it is available (false 
negative). Intuitively, the probability of such events is 
proportional to the distance between the local replica im- 
age and some consistent final image. 

One interesting aspect of this application is that its 
consistency requirements change dynamically based on 
client, network, and application characteristics. For in- 
stance, the system may wish to minimize the rate of 
inquiries/updates that observe inconsistent intermediate 
states for certain preferred clients. Requests from such 
clients may require a replica to update its consistency 
level (by synchronizing with other replicas) before pro- 
cessing the request or may be directed to a replica that 
maintains the requisite consistency by default. As an- 
other example, if network capacity (latency, bandwidth, 
error rate) among replicas is abundant, the absolute per- 
formance/availability savings may not be sufficient to 
outweigh the costs associated with weaker consistency 
models. Finally, the desired consistency level depends 
on individual application semantics. For airline reserva- 
tions, the cost of a transaction that must be rolled back is 
fairly small when a flight is empty (one can likely find an 
alternate seat on the same flight), but grows as the flight 
fills. 


2.2. Bulletin Board 


The bulletin board application is a replicated mes- 
sage posting service modeled after more sophisticated 
services such as USENET. Messages are posted to indi- 
vidual replicas. Sets of updates are propagated among 
replicas, ensuring that all messages are eventually dis- 
tributed to all replicas. This application is intended to be 
representative of interactive applications that often allow 


concurrent read/write access under the assumption that 
conflicts are rare or can be resolved automatically. 

Desirable consistency requirements for the bulletin 
board example include maintaining causal and/or total 
order among messages posted at different replicas. With 
causal order, a reply to a message will never appear be- 
fore the original message at any replica. Total order en- 
sures that all messages appear in the same order at all 
replicas, allowing the service to assign globally unique 
identifiers to each message. Another interesting consis- 
tency requirement for interactive applications, including 
the bulletin board, is to guarantee that at any time ¢, no 
more than k messages posted before t are missing from 
the local replica. 


2.3. QoS Load Distribution 


The final application implemented in our framework 
is a load distribution mechanism that provides Quality 
of Service (QoS) guarantees to a set of preferred clients. 
In this scenario, front-ends (as in LARD [27]) accept re- 
quests on behalf of two classes of clients, standard and 
preferred. The front ends forward requests to back end 
servers with the goal of reserving some pre-determined 
portion of server capacity for preferred clients. Thus, 
front ends allow a maximum number of outstanding re- 
quests (assuming homogeneous requests) at the back end 
servers. To determine the maximum number of “stan- 
dard” requests that should be forwarded, each front end 
Must communicate current access patterns to all other 
front ends. 

One goal of designing such a system is to minimize 
the communication required to accurately distribute such 
load information among front ends. This QoS applica- 
tion is intended to be representative of services that in- 
dependently track the same logical data value at multiple 
sites, such as a distributed sensor array, a load balancing 
system, or an aggregation query. Such services are often 
able to tolerate some bounded inaccuracy in the under- 
lying values they track (e.g., average temperature, server 
load, or employee salary) in exchange for reduced com- 
munication overhead or power consumption. 


3 System Design 


In this section, we first describe the basic replica- 
tion system model we assume, and then elaborate on the 
model and metrics we provide to allow applications to 
continuously specify consistency level. 


3.1 System Model 


For simplicity, we refer to application data as a 
data store, though the data can actually be stored in a 
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database, file system, persistent object, etc. The data 
store is replicated in full at multiple sites. Each replica 
accepts requests from users that can be made up of multi- 
ple primitive read/write operations. TACT mediates ap- 
plication read/write access to the data store. On a sin- 
gle replica, a read or write is isolated from other reads 
or writes during execution. Depending on the specified 
consistency requirements, a replica may need to contact 
other replicas before processing a particular request. 


Replicas exchange updates by propagating writes. 
This can take the form of gossip messages [22], anti- 
entropy sessions [13, 28], group communication [5], 
broadcast, etc. We chose anti-entropy exchange as our 
write propagation method because of its flexibility in 
operating under a variety of network scenarios. Each 
write bears an accept stamp composed of a logical clock 
time [23] and the identifier of the accepting replica. 
Replicas deterministically order all writes based on this 
accept stamp. As in Bayou [28, 34], updates are proce- 
dures that check for conflicts with the underlying data 
store before being applied in a tentative state. A write is 
tentative until a replica is able to determine the write’s 
final position in the serialization order, at which point it 
becomes committed through a write commitment algo- 
rithm (described below). 


Each replica maintains a logical time vector, similar 
to that employed in Bayou and in Golding’s work [13, 
28, 34]. Briefly, each entry in the vector corresponds 
to the latest updates seen from a particular replica. The 
coverage property ensures that a replica has seen all up- 
dates (remote and local) up to the logical time corre- 
sponding tothe minimum value in its logical time vector. 
This means the serialization positions of all writes with 
smaller logical time than that minimal value can be de- 
termined and thus those writes can be committed. Anti- 
entropy sessions update values in each replica’s logical 
time vector based on the logical times/replicas of the 
writes exchanged. Note that writes may have to be re- 
ordered or rolled back before as dictated by serialization 
order. 


While TACT’s implementation of anti-entropy is not 
particularly novel, a primary aspect of our work is de- 
termining when and with whom to perform anti-entropy 
in order to guarantee a minimum level of consistency. 
Replicas may propagate writes to other replicas at any 
time through voluntary anti-entropy. However, we are 
more concemed with write propagation required for 
maintaining a desired level of consistency, called com- 
pulsory anti-entropy. Compulsory anti-entropy is neces- 
sary for the correctness of the system, while voluntary 
anti-entropy only affects performance. 
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3.2 A Continuous Consistency Model 


In our consistency model, applications specify their 
application-specific consistency semantics using conits. 
A conit is a physical or logical unit of consistency, de- 
fined by the application. For example, in the airline 
reservation example, individual flights or blocks of seats 
on a flight may be defined as a conit. An interesting 
issue beyond the scope of this paper is setting the granu- 
larity of conits. The required per-conit accounting over- 
head (described below) argues for coarse conit granu- 
larity. Conversely, coarse-grained conits may introduce 
issues of false sharing as updates to one data item in a 
conit may reduce performance/availability for accesses 
to logically unrelated data items in the same conit. 

For each conit, we quantify consistency continuously 
along a three-dimensional vector: 


Consistency = 


(Numerical Error, Order Error, Staleness) 


Numerical Error bounds the discrepancy between the 
value of the conit relative to its value in the “final image.” 
For applications that maintain numerical records, the se- 
mantics of: this metric are straightforward. For other 
applications, however, application-specific weights (de- 
faulting to one) can be assigned to individual writes. The 
weights are the relative importance of the writes, from 
the application’s point of view. Numerical error then 
becomes the total weighted unseen writes for a conit. 
Based on application semantics, two different kinds of 
numerical error, absolute numerical error and relative 
numerical error, can be defined. Order Error measures 
the difference between the order that updates are applied 
to the local replica relative to their ordering in the even- 
tual “final image.” Staleness bounds the difference be- 
tween the current time and the acceptance time of the 
oldest write on a conit not seen locally. 

Figure 2 illustrates the definition of order error and 
numerical error in a simple example. Two replicas, A 
and B, accept updates on a conit containing two data 
items, x and y. The logical time vector for A is (24, 5). 
The coverage property implies that all writes in its log 
with logical time less than or equal to five are commit- 
ted (indicated by the shaded box), leaving three tenta- 
tive writes. Similarly, the logical time vector for B is 
(0, 17), meaning that both writes in its log are tentative. 
Order error bounds the maximum number of tentative 
writes at a replica, i.e., the maximum number of writes 
that may have to be reordered or rolled back because of 
activity at other replicas. In general, a lower bound on 
order error implies a lower probability that a read will 
observe an inconsistent intermediate state. In this exam- 
ple, if A’s order error is bounded to three, A must invoke 
the write commitment algorithm—performing compul- 
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Order Error: 3 
Numerical Error: 1 (1) 
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(5, B): x += 2 


(16, B): y += 1 


Logical Time Vector: (0, 17) 
Order Error: 2 
Numerical Error: 3 (5) 


Figure 2: Example scenario for bounding order error and numerical error with two replicas. 


sory anti-entropy to pull any necessary updates from B 
to reduce its number of tentative writes—before accept- 
ing any new writes. 


Figure 2 also depicts the role of numerical error. Nu- 
merical error is the weight of all updates applied to a 
conit at all replicas not seen by the local replica. In the 
example, the weight of a write is set to be the update 
amount to either z or y, so that a “major” update is more 
important than a “minor” update. The replica A has not 
seen one update (with a weight of one) in this example, 
while B has not seen three updates (with a total weight 
of five). Note that order error can be relaxed or tightened 
using only local information. Bounding numerical er- 
ror, on the other hand, relies upon the cooperation of all 
replicas. Thus, dynamically changing numerical error 
bounds requires the execution of a consensus algorithm. 


One benefit of our model is that conit consistency can 
be bounded on a per-replica basis. Instead of enforcing a 
system-wide uniform consistency level, each replica can 
have its own independent consistency level for a conit. 
A simple analysis can show that as a replica relaxes 
its consistency while other replicas’ consistency levels 
remain unchanged, the total communication amount of 
that replica is reduced. For relaxed numerical error, it 
means other replicas can push writes to that replica less 
frequently, resulting in fewer incoming messages. Out- 
going communication amount remains unchanged since 
that is determined by the consistency levels of other 
replicas. However, since numerical error is bounded us- 
ing a push approach, if the replica is too busy to han- 
dle the outgoing communication, writes submitted to it 
will be delayed. Similar, if the replica relaxes order er- 
ror and staleness, incoming communication amount will 
be decreased. Thus, one site may have poor network 
connectivity and limited processing power, making more 
relaxed consistency bounds appropriate for that replica. 


Conversely, it may be cheap (from a performance and 
availability standpoint) to enforce stronger consistency 
at a replica with faster links and higher processing ca- 
pacity. One interesting aspect of this model is that it 
potentially allows the system to route client requests to 
replicas with appropriate consistency bounds on a per- 
request basis. For instance, in the airline reservation sy s- 
tem, requests from “preferred” clients may be directed to 
areplicathat maintains higher consistency levels (reduc- 
ing the probability of an inconsistent access). 


When all three metrics are bounded to zero, our con- 
tinuous consistency model reaches the strong consis- 
tency extreme of the spectrum, which is serializabil- 
ity [4] and external consistency [1, 12]. If no bounds are 
set for any of the metrics, there will be no consistency 
guarantees, similar to optimistic consistency systems. In 
moving from strong to optimistic consistency, applica- 
tions bound the maximum logical “distance” between 
the local replica image and the (unknown) consistent im- 
age that contains all writes in serial order. This distance 
corresponds directly to the percentage chance that a read 
will observe inconsistent results or that a write will in- 
troduce a conflict. In the next section, we will demon- 
strate how our three applications employ these metrics 
to capture their consistency requirements. Based on our 
experience with TACT, we believe that the above metrics 
allow a broad range of applications to conveniently ex- 


' press their consistency requirements. Of course, the ex- 


act set of metrics is orthogonal to our goal of exporting a 
flexible, continuous, and dynamically tunable spectrum 
of consistency models to replicated services. 
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Application Conit Definition 
Bulletin 


Board 


Consistency Semantics 


A. Message Ordering 

B. Unseen Messages 

C. Message Delay 

A. Reservation Conflict 
Rate R 

B. Inconsistent Query 
Results 


A Newsgroup 


Airline 
Reservation 


QoS Load A. Accuracy of Resource | Resource 
Distribution Consumption Info Consumption Info 


Seats on a Flight 





Weight Detinition Metrics Capturing the Semantics 
A. Order Error | 
B. Absolute Numerical Error 
C. Staleness 
A. Relative Numerical Error 
Rmaz =1 —1/(1 +7) 
Ravg = (1—1/(1+))/2 
B. Order Error and Staleness 


A. Relative Numerical Error 





(Subjective) 
Importance of 

a News Message 
Reservation: 1 


Request Forward: 1 
Request Retum : -1 


Table 1: Expressing high-level application-specific consistency semantics using the TACT continuous consistency 


model. 


3.3. Expressing Application-specific Consis- 
tency Semantics through Conits 


One important criteria for the evaluation of any con- 
sistency model is whether it captures the semantic re- 
quirements of a broad range of applications. Thus, in 
this section we describe the ways different consistency 
levels affect the semantics of the three representative ap- 
plications described in Section 2 and explain how these 
semantics are captured by our model. Table 1 sum- 
marizes these application-specific consistency semantics 
and their expression using TACT, as detailed in the dis- 
cussion below. 

For the distributed bulletin board, one consistency re- 
quirement is the ordering of messages, which is captured 
by the order error metric. More specifically, for this ap- 
plication order error is the number of messages that may 
appear out of order at any replica. However, it is possible 
that such a bound is overly restrictive for unrelated mes- 
sages, e.g., for two messages posted to different news- 
groups. In this case, a conit can be defined for each 
newsgroup to more precisely specify ordering require- 
ments. Another possible consistency requirement is the 
maximum number of remotely posted messages that are 
unseen by the local replica at a particular time. Our nu- 
merical error metric serves to express this type of seman- 
tics. Our model allows application-specific weights to be 
assigned to each write, allowing users to (subjectively) 
force the propagation of certain writes. A third consis- 
tency requirement for this application is message delay, 
that is, the delay between the time a message is posted 
and the time it is seen by all replicas. This requirement 
can be translated to staleness in a straight-forward man- 
ner. 

Moving to the airline reservation example, one im- 
portant manifestation of system consistency is the per- 
centage of conflicting reservations. An interesting as- 
pect of this application is TACT’s ability to limit reser- 
vation conflict rate by bounding relative numerical error 
based on the application’s estimate of available seats. To 


simplify the discussion, we assume single seat reserva- 
tions (though our model and implementation are more 
general) and define a conit over all seats on a flight 
with each reservation carrying a numerical weight of - 
1. Initially, the value of the conit is the total number of 
seats on the flight. As reservations come in, the value 
of the conit is the number of available seats in each 
replica’s data store. Suppose reservations are randomly 
distributed among all available seats. For a reservation 
accepted by one replica, the probability that it conflicts 
with anotherremote (unseen) reservation is U/V, where 
U is the number of unseen reservations, and V is the 
number of available seats as seen by the local replica. 
Suppose Vfina: is the accurate count of available seats, 
such that Vyinay = V — U. Thus, the rate of conflicting 
reservations, R, equals 1 — Vyinar/V. If y bounds the 
maximum relative numerical error of the conit then, by 
definition, we have —y < 1—V/Veinat = Vyinat > 
1/(1+ y) x V. Thus, the upper bound on R, Rmaz = 
1 —Vyinai/V = 1-—1/(1+7). Since in this example, 
Vs inat is always smaller than or equal to V, the average 
value of Vina: Should then be (1 + 1/(1+ y))/2 x V; 
Thus, the average rate of conflicting reservations, Rayg, 
equals to 1 — Vfing /V = 1- (14+ 1/(1+ 7))/2 = 
(1 —1/(1+))/2. In Section 5, we present experimen- 
tal results to verify this analysis. Non-random reserva- 
tion behavior will result in a higher conflict rate than the 
above formula. However, applications can still reduce 
the expected/maximum conflict rate by specifying mul- 
tiple conits per flight, e.g., multiple conits for first class 
versus coach seats or aisle versus window seats. 


Other consistency semantics for the airline reservation 
example can be expressed using order error or staleness. 
For example, the system may wish to limit the percent- 
age of queries that access an inconsistent image, i.e., see 
a multi-seat reservation that must later be rolled back be- 
cause of a conflicting single-seat reservation at another 
replica. Such consistency semantics can be enforced by 
properly bounding the limit on order error (an analysis 
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is omitted for brevity). 

In our third application, QoS load distribution, front 
ends estimate the total resource consumption for stan- 
dard clients as the total number of outstanding standard 
requests on the back ends. This value also serves as the 
definition of a conit for this application. Front ends in- 
crease this value by 1 upon forwarding a request from 
a standard client and decrease it by 1 when the request 
returns. If this value exceeds a pre-determined resource 
consumption limit, front ends will not forward new stan- 
dard client requests until resource consumption drops 
below this limit. The relative numerical error of each 
front end’s estimate of resource consumption captures 
this application’s consistency semantics — each front 
end is guaranteed that its estimate of resource consump- 
tion is accurate within a fixed bound. Note that this load 
balancing application is not concerned with order error 
(writes are interchangeable) or staleness (no need to syn- 
chronize if the mix of requests does not change). 


3.4 Bounding Consistency Metrics 


Given the assumed system model, we now describe in 
tum our algorithms for bounding numerical error, order 
error, and staleness. Note that the details and correct- 
ness proofs for our numerical error algorithms available 
separately [38]. 

The first algorithm, Split-weight AE, employs a 
“push” approach to bound absolute numerical error. It 
“allocates” the allowed positive and negative error for a 
server evenly to other servers. Each server; maintains 
two local variables x and y for server;,j7 # i. Intu- 
itively, the variable x is the total weight of negatively- 
weighted writes that server; accepts but has not been 
seen by server;. server; has only conservative knowl- 
edge (called its view) of what writes server; has seen. 
The variable x is updated when server; accepts a new 
write with a negative weight or when server;’s view 
is advanced. Similarly, the variable y records the total 
weight of positively-weighted writes. Suppose the abso- 
lute error bound on server; is aj. In other words, we 
want to ensure that |Vfinat — Vj| < aj, where Veinal 
is the consistent value and V; is the value on server ;. 
To achieve this, server; makes sure that at all times, 
x > -—a;/(n — 1) andy < a;/(n — 1), where n is 
the total number of servers in the system. This may re- 
quire server; to push writes to server; before accepting 
anew write. 

Split-Weight AE is pessimistic in the sense that 
server; May propagate writes to server; when not 
actually necessary. For example, the algorithm does 
not consider the case where negative weights and pos- 
itive weights may offset each other. We developed an- 
other optimal algorithm, Compound-Weight AE, to ad- 


dress this limitation at the cost of increased space over- 
head. However, simulations indicate that potential per- 
formance improvements do not justify the additional 
computational complexity and space overhead [38]. 

A third algorithm, Inductive RE, provides an effi- 
cient mechanism for bounding the relative error in nu- 
merical records. The algorithm transforms relative er- 
ror into absolute error. Suppose the relative error 
bound for server; is y;, that is, we want to ensure 
[1 — V;/Vyinatl| < 7, equivalent to |Véina: — V;| < 
1; < Vyinat. A naive transforming approach would use 
13 X Vfinat aS the corresponding absolute error bound, 
requiring a consensus algorithm to be run to determine a 
new absolute error bound each time Vinay changes. 

Our approach avoids this cost by conservatively re- 
lying upon local information as follows. We observe 
that the current value V; on any server; was properly 
bounded before the invocation of the algorithm and is 
an approximation of Vinal. So server; may use V; as 
an approximate norm to bound relative error for other 
servers. More specifically, for server;, we know that 
Viinat — Vi = —7%i X Veinat, where 7; is the relative er- 
ror bound for server;, which transforms to Vyjnat > 
V,/(1 + 7%). Using this information to substitute for 
V+ inat On the right-hand side in the inequality in the last 
paragraph produces: 


V; 


IVrinat -Vil < Xx ise 





Thus, to bound relative error, server; only needs to 
recursively apply Split-Weight AE, using y; x V;/(1+ 
i) as aj. Note that while this approach greatly in- 
creases performance by eliminating the need to run a 
consensus algorithm among replicas, it uses local in- 
formation (V;/(1 + ¥;)) to approximate potentially un- 
known global information (V¢inat) in bounding relative 
error. Thus it behaves conservatively (bounding values 
more than strictly necessary) when relative error is high 
as will be shown in our evaluation of these algorithms in 
Section 5. 

To bound order error on a per-conit basis, a replica 
first checks the number of tentative writes on a conit 
in its write log. If this number exceeds the order error 
limit, the replica invokes a write commitment algorithm 
to reduce the number of tentative writes in its write log. 
This algorithm operates as follows. The replica pulls 
writes from other replicas by performing compulsory 
anti-entropy sessions to advance its logical time vector, 
allowing it to commit some set of its tentative writes. 
In doing so, the replica ensures that it remains within a 
specified order error bound before accepting new tenta- 
tive writes. 

To bound the staleness of a replica, each server main- 
tains a real time vector. This vector is similar to the log- 
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ical time vector, except that real time instead of logical 
time is used. A similar coverage property is preserved 
between the writes a server has seen and the real time 
vector. If A’s real time vector entry corresponding to 
B is t, then A has seen all writes accepted by B before 
real time t. To bound staleness within J, a server checks 
whether current time —t < l holds for each entry in 
the real time vector.! If the inequality does not hold 
for some entries, the server performs compulsory anti- 
entropy session with the corresponding servers, pulling 
writes from them, and advances the real time vector. 
This pull approach may appear to be less efficient than a 
push approach because of unnecessary polling when no 
updates are available. However, a push approach can- 
not bound staleness if there is no upper limit on network 
delay or processing time. 


4 System Architecture 


The current prototype of TACT is implemented in 
Java 1.2 using RMI for communication (e.g., for ac- 
cepting read/write requests and for write propagation). 
TACT replicas are multi-threaded, thus if one write in- 
curs compulsory write propagation, it will not block 
writes on other conits. We implemented a simple custom 
database for storing and retrieving data values, though 
our design and implementation is compatible with a va- 
riety of storage mechanisms. 

Each TACT replica maintains a write log, and allows 
redo and undo on the write log. It is also responsi- 
ble for all anti-entropy sessions with remote replicas. 
The system supports parallel anti-entropy sessions with 
multiple replicas, which can improve performance sig- 
nificantly for compulsory anti-entropy across the wide 
area. For increased efficiency, we also implement a one- 
round anti-entropy push. With standard anti-entropy, be- 
fore a replica pushes writes to another replica, it first 
obtains the target replica’s logical time vector to deter- 
mine which writes to propagate. However, we found that 
this two-round protocol can add considerable overhead 
across the wide area, especially at stronger consistency 
levels (where the pushing replica has a fairly good notion 
of the writes seen by the target replica). Thus, we allow 
replicas to push writes using their local view as a hint, re- 
ducing two rounds of communication to one round at the 
cost of possibly propagating unnecessary writes. While 
the current implementation uses this one round protocol 
by default, dynamically switching between the variants 
based on the consistency level would be ideal. 

TACT replicas also implement a consistency manager 
responsible for bounding numerical error, order error 


'We assume that server clocks are loosely synchronized. 


and staleness. The variables needed by the Split-Weight 
AE and Inductive RE algorithms are maintained in hash 
tables to reduce space overhead and enable the system to 
potentially scale to thousands of conits. 

In bounding numerical error, a replica may need to 
push a write to other replicas before the write can re- 
turn, e.g., if a write has a weight that is larger than an- 
other replica’s absolute error bound. There are two pos- 
sible approaches for addressing this. One approach is a 
one-round protocol where the local site applies the write, 
propagates it to the necessary remote replicas, awaits ac- 
knowledgments, and finally returns. This one-round pro- 
tocol is appropriate for applications where writes are in- 
terchangeable such as resource accounting/load balanc- 
ing. For other applications, such as the airline reserva- 
tion example, a reservation itself observes a consistency 
level (the probability it conflicts with another reserva- 
tion submitted elsewhere). In such a case, a stronger 
two-round protocol is required where the replica first 
acquires remote data locks, pushes the write to remote 
replicas, and then returns after receiving all acknowledg- 
ments. Such a two-round protocol ensures the numerical 
error observed by a write is within bound at the time the 
update is submitted. In our prototype, both protocols are 
implemented and the application is allowed to choose 
based on its own requirements. 


5 Experience and Evaluation 


Given the description of our system architecture, we 
now discuss our experience in building the three appli- 
cations described in Section 2 using the TACT infras- 
tructure. We define conits and weights in these applica- 
tions according to the analysis in Section 3.3. The exper- 
iments below focus on TACT’s ability to bound numeri- 
cal error and order error. While implemented in our pro- 
totype, we do not present experiments addressing stale- 
ness for brevity and because bounding staleness is well- 
studied, e.g., in the context of Web proxy caching [10]. 


5.1 Bulletin Board 


For our evaluation of the bulletin board application, 
we deployed replicas at three sites across the wide area: 
Duke University (733 Mhz Pentium III/Solaris 2.8), Uni- 
versity of Utah (350 Mhz Pentium II/FreeBSD 3.4) 
and University of California, Berkeley (167 Mhz Ultra 
I/Solaris 2.7). All data is collected on otherwise un- 
loaded systems. Each submitted message is assigned a 
numerical weight of one (all messages are considered 
equally important). 

We conduct a number of experiments to explore the 
behavior of the system at different points in the con- 
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Figure 3: Average latency for posting messages to a 
replicated bulletin board as a function of consistency 
guarantees. 
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Figure 4: Breakdown of the overhead of posting a mes- 
sage under a number of scenarios. 


sistency spectrum. Figure 3 plots the average latency 
for a client at Duke to post 200 messages as a function 
of the numerical error bound on the x-axis. For com- 
parison, we also plot the average latency for a conven- 
tional implementation using a two-phase update proto- 
col. For each write, this protocol first acquires necessary 
remote data locks, then propagates the update to all re- 
mote replicas. The figure shows how applications are 
able to continuously trade performance for consistency 
using TACT. As the numerical error bound increases, av- 
erage latency decreases. Increasing allowable order er- 
ror similarly produces a corresponding decrease in av- 
erage latency. Relative to the conventional implementa- 
tion, allowing each replica to have up to 20 unseen mes- 
sages and leaving order error unbounded reduces aver- 
age latency by a factor of 10. 

One interesting aspect of Figure 3 is that TACT per- 
forms worse than the standard two-phase update proto- 
col at the strong consistency end of the spectrum. To 


investigate this overhead, Figure 4 summarizes the per- 
formance overheads associated with message posts us- 
ing TACT at four points in the consistency spectrum 
(varying order error with numerical error set to zero) in 
comparison to the conventional two-phase update proto- 
col. All five configurations incur approximately 130ms 
to sequentially (required to avoid deadlock) acquire data 
locks from two remote replicas and 80ms to push writes 
to these replicas in parallel. Since the cost of remote pro- 
cessing is negligible, this overhead comes largely from 
wide-area latency. Compared to the conventional imple- 
mentation, TACT with zero numerical error and zero or- 
der error (i.e., same consistency level) incurs about 83% 
more overhead. This additional overhead stems from the 
additional 140ms to bound order error. This is an in- 
teresting side effect associated with our implementation. 
Our design decomposes consistency into two orthogo- 
nal components (numerical error and order error) that 
are bounded using two separate operations, doubling the 
number of wide-area round trip times. When order error 
and numerical error are both zero, TACT should com- 
bine the push and pull of write operations into a single 
step as a performance optimization, as is logically done 
by the conventional implementation. This idea is espe- 
cially applicable if we use the recently proposed quo- 
rum approach[16, 17] to commit writes. A preliminary 
implementation of this optimization shows that TACT’s 
overhead (at strong consistency) drops from 367ms to 
about 217ms, within 8% of the conventional approach. 


5.2 Airline Reservation System 


We now evaluate our implementation of the simple 
airline reservation system using TACT. Once again, we 
deployed three reservation replicas at Duke, Utah and 
Berkeley. We considered reservation requests for a sin- 
gle flight with 400 seats. Each client reservation request 
is for a randomly chosen seat on the flight. If a tentative 
reservation conflicts with a request at another replica, a 
merge procedure attempts to reserve a second seat on the 
same flight. If no seats are available, the reservation is 
discarded. A conit is defined over all seats on the flight, 
with an initial value of 400. Each reservation carries a 
numerical weight of -1. 

In Section 3.3, we derived a relationship between the 
reservation conflict rate R and the relative error bound ¥: 
Rmaz = 1—1/(14+7) and Ravg = (1—1/(1+ 9)) /2. 
We conduct the following experiment to verify that an 
application can limit the reservation conflict rate by sim- 
ply bounding the relative numerical error. Figure 5 plots 
the measured conflicting reservation rate A, the com- 
puted upper bound Rmaz and the computed average rate 
Ravg as a function of relative numerical error. Order er- 
ror and staleness are not bounded in these experiments. 
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Figure 5: Percentage of conflicting reservations as a 
function of the bound on numerical error. 
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Figure 6: Average latency for making a reservation as a 
function of consistency guarantees. 


The experiments are performed with two replicas on a 
LAN at Duke, each attempting to make 250 (random) 
reservations with the results averaged across four runs. 

The measured conflict rate roughly matches the com- 
puted average rate and is always below the computed 
upper bound, demonstrating that numerical error can be 
used to bound conflicting accesses as shown by our anal- 
ysis. Note that as the bound on relative error is relaxed, 
the discrepancy between the measured rate and the com- 
puted average rate gradually increases because of con- 
servativeness inherent in the design of our Inductive RE 
algorithm (i.e., at relaxed consistency, our algorithm per- 
forms more write propagation than necessary). As de- 
scribed in Section 3, this conservative behavior greatly 
improves performance by allowing each replica to bound 
relative error using only local information. 

The latency and throughput measurements, summa- 
rized in Figures 6 and 7 for airline reservations are sim- 
ilar to the bulletin board application described above, 
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Figure 7: Update throughput for airline reservations as a 
function of consistency guarantees. 


The latency experiments are run on the same wide-area 
configuration as the bulletin board. The plotted latency 
is the average observed by a single Duke client mak- 
ing 400 reservations. For throughput, we run two client 
threads at each of the replica sites, with each thread re- 
questing 400/(2 x 3) = 67 (random) seats in a tight 
loop. We also plot the application’s performance using 
a two-phase update protocol, showing the same trends 
as the results for the bulletin board application. As con- 
sistency is gradually relaxed, TACT achieves increasing 
performance by reducing the amount of required wide- 
area communication. 


5.3 Quality of Service for Web Servers 


For our final application, we demonstrate how 
TACT’s numerical error bound can be used to accurately 
enforce quality of service (QoS) guarantees among Web 
servers distributed across the wide area. Recall that a 
number of front-end machines forward requests on be- 
half of both standard and preferred clients to back end 
servers. In our implementation, we use TACT to dy- 
namically trade communication overhead in exchange 
for accuracy in measuring total resources consumed by 
standard clients. The front ends estimate the standard 
client resource consumption as the total number of out- 
standing standard requests on the back ends. If this re- 
source consumption exceeds a pre-determined resource 
consumption limit, front ends will not forward new stan- 
dard client requests until resource consumption drops 
below this limit. For simplicity, all our experiments are 
run on a local-area network at Duke on seven 733 Mhz 
Pentium III’s running Solaris 2.8. Three front ends (each 
running on a separate machine) generate requests in a 
round robin fashion to three back end servers running 
Apache 1.3.12. 
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Figure 8: The average latency seen by a preferred client 
as a function of time. 
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Table 2: The tradeoff between TACT-enforced numeri- 
cal error and communication overhead. 


For our experiments, the three front end machines 
generate an increasing number of requests from standard 
clients. As a whole, the system desires to bound the 
number of outstanding standard client requests to 150. 
A fourth machine, representing a preferred client, peri- 
odically polls a random back end to determine system 
latency. Each of the three front ends starts a new stan- 
dard client every two seconds which then continuously 
requests the same dynamically generated Web page re- 
quiring 10ms of computation time. If all front ends had 
global knowledge of system state, each front end would 
start a total of 50 standard clients. However, depending 
on the bound placed on numericalerror, front ends may 
in fact start more than this number (up to 130 in the ex- 
periment described below). For simplicity, no standard 
clients are torn down even if the system learns that too 
many (i.e., more than 150) are present in aggregate. Ide- 
ally, this aggregate number would oscillate around 150 
with the amplitude of the oscillation being determined 
by the relative numerical bound. 

Figure 8 depicts latency observed by the preferred 
client as a function of elapsed time (corresponding to 
the total number of standard clients making requests). 
At time 260, each front end has tried to spawn up to 130 
standard clients. The curves show the average latency 


observed by the preferred client for different bounds on 
numerical error. For comparison purposes, we also show 
the latency (1745ms) of a preferred client when there are 
exactly 150 outstanding standard client requests. In the 
first curve, labeled “Relative Error=0,” the system main- 
tains strong consistency. Therefore, the front ends are 
able to enforce the resource limit strictly. The curve cor- 
responding to a relative error of O flattens at 100 sec- 
onds (when three front ends have created a total of 150 
standard clients) with latency very close to the ideal of 
1745ms. As the bound on relative error is relaxed to 0.3, 
0.5, and 1, the resource consumption limit for standard 
clients is more loosely enforced. The curve “No_QoS” 
plots the latency where no resource policy is enforced. 
Similar to the airline reservation application, the discrep- 
ancy between the relative error upper bound of | and the 
“No_Qos” curve stems from the conservativeness of the 
Inductive RE algorithm. 

Table 2 quantifies the tradeoff between numerical er- 
ror and communication overhead. Clearly, front ends 
can maintain near-perfect information about the load 
generated from other replicas at the cost of sending one 
message to all peers for each event that takes place. This 
is the case when zero numerical error is enforced by 
TACT: Each replica sends 50 messages to each of two 
remote replicas (for a total of 300) corresponding to the 
number of logical events that take place during the ex- 
periment. Once each front end starts 50 standard clients, 
strong consistency ensures that no further messages are 
necessary. Of course, such accuracy is typically not re- 
quired by this application. Table 2 shows that commu- 
nication overhead drops rapidly in exchange for some 
loss of accuracy. Note that this drop off will be more 
dramatic as the number of replicas is increased as a re- 
sult of the all-to-all communication required to maintain 
strong consistency. 


6 Related Work 


The tradeoff between consistency and _perfor- 
mance/availability is well understood [7, 8]. Many 
systems have been built at the two extremes of the 
consistency spectrum. Traditional replicated trans- 
actional databases use strong consistency (one-copy 
serializability [4]) as a correctness criterion. At the 
other end of the spectrum are optimistic systems such as 
Bayou [28, 34], Ficus [14], Rumor [15] and Coda [18]. 
In these systems, higher availability/performance is 
explicitly favored over strong consistency. Besides 
Bayou, none of the above systems provide support for 
different consistency levels. Bayou provides session 
guarantees [9, 33] to ensure that clients switching from 
one replica to another view a self-consistent version of 
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the underlying database. However, session guarantees 
do not provide any guarantees regarding the consistency 
level of a particular replica. 


In [37], we present a position paper describing an 
earlier iteration of our consistency model, using dif- 
ferent consistency metrics and concentrating on con- 
sistency/availability tradeoffs. A number of other ef- 
forts also attempt to numerically capture applications’ 
consistency requirements. These techniques can be 
vaguely categorized into two classes: Relaxing consis- 
tency among replicas to reduce required communica- 
tion (replica control) [3, 6, 19, 26, 32, 35] and relax- 
ing consistency for transactions on a single site to allow 
increased concurrency on that site (concurrency con- 
trol) (2, 20, 21, 29, 30, 36]. TACT is more closely related 
to replica control techniques. However, previous con- 
sistency models for replica control typically exploit the 
consistency semantics of a particular application class, 
abstracting its consistency requirements along a single 
dimension. Most of the proposed consistency metrics 
can be expressed within our model by constraining a 
subset of numerical error, order error, and staleness. Kr- 
ishnakumar and Bernstein [19] propose the concept of 
an “N-ignorant” system, where a transaction runs in par- 
allel with at most N conflicting transactions. By set- 
ting absolute numerical error bound to N and by assign- 
ing unit weights to writes, TACT demonstrates behav- 
ior similar to an “N-ignorant” system. Timed consis- 
tency [35] and delta consistency [32] address the lack 
of timing in traditional consistency models such as se- 
quential consistency. These timed models can be read- 
ily expressed using our staleness metric. Quasi-copy 
caching [3] proposes four “coherency conditions,” de- 
lay condition, frequency condition, arithmetic condition 
and version condition appropriate for read-only caching. 
TACT, on the other hand, is designed for more gen- 
eral read/write replication. Two recent efforts [6, 26] 
use metrics related to numerical error and staleness to 
measure database freshness. However, these systems do 
not provide mechanisms to bound data consistency us- 
ing the proposed metrics. Relative to these efforts, our 
conit-based three-dimensional consistency model allows 
a wide range of services to dynamically express their 
consistency semantics based on application, network, 
and client-specific characteristics. 


Concurrency control techniques using relaxed consis- 
tency models [2, 20, 21, 29, 30, 36] are related to replica 
control and TACT, in that consistency also needs to be 
quantified there. However, enforcing user-defined con- 
sistency levels is inherently easier in concurrency con- 
trol than in replica control because in the former case 
most information needed to compute the amount of in- 
consistency is available on a single site. In other words, 
the consistency models do not need to consider “final 


image,” which might be unknown to all replicas. Since 
all our three metrics are related to “final image,” none of 
them can be expressed using relaxed consistency models 
for concurrency control. 

In fluid replication [24], clients are allowed to dynam- 
ically create service replicas to improve performance. 
Their study on when and where to create a service 
replica is complementary to our study on tunable con- 
sistency issues among replicas. Similar to Ladin’s sys- 
tem [22], fluid replication supports three consistency 
levels: last-writer, optimistic and pessimistic. Our work 
focuses on capturing the spectrum between optimistic 
and pessimistic consistency models. Varying the fre- 
quency of reconciliation in fluid replication allows ap- 
plications to adjust the “strength” of the last-writer and 
optimistic models. Bounding staleness in TACT has 
similar effects. However, as motivated earlier, staleness 
alone does not fully capture application-specific consis- 
tency requirements. 

Fox and Brewer [11] argue that strong consistency and 
one-copy availability cannot be achieved simultaneously 
in the presence of network partitions. In the context 
of the Inktomi search engine, they show how to trade 
harvest for yield. Harvest measures the fraction of the 
data reflected in the response, while yield is the prob- 
ability of completing a request. In TACT, we concen- 
trate on consistency among service replicas, but a similar 
“harvest” concept can also be defined using our consis- 
tency metrics. For example, bounding numerical error 
has similar effects as guaranteeing a particular harvest. 
Finally, Olston and Widom [25] address tunable perfor- 
mance/precision tradeoffs in the context of aggregation 
queries over numerical] database records. 


7 Conclusions and Future Work 


Traditionally, designers of replicated systems have 
been forced to choose between strong consistency, with 
its associated performance overheads, and optimistic 
consistency, with no guarantees regarding the probabil- 
ity of conflicting writes or stale reads. In this paper, 
we explore the space in between these two extremes. 
We present a continuous consistency model where ap- 
plication designers can bound the maximum distance 
between the local data image and some final consistent 
state. This space is parameterized by three metrics, Nu- 
merical Error, Order Error, and Staleness. We show 
how TACT, a middleware layer that enforces consistency 
bounds among replicas, allows applications to dynami- 
cally trade consistency for performance based on current 
service, network, and request characteristics. A perfor- 
mance evaluation of three replicated applications, an air- 
line reservation system, a bulletin board, and a QoS Web 
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service, implemented using TACT demonstrates signifi- 
cant semantic and performance benefits relative to tradi- 
tional approaches. 

We are investigating a number of interesting questions 
posed by the TACT consistency model. We are currently 
working on both theoretical and practical issues asso- 
ciated with trading system consistency for availability. 
Theoretically, is there an upper bound on availability 
given a consistency level with particular numerical er- 
ror, order error and staleness? Practically, how close to 
this upper bound can the TACT prototype provide dy- 
namically tunable consistency and availability? Simi- 
larly, can TACT adaptively set application consistency 
levels in response to changing wide-area network per- 
formance characteristics using application-specified tar- 
gets for minimum performance, availability, and consis- 
tency? 
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Abstract 


This paper presents a new persistent data manage- 
ment layer designed to simplify cluster-based Internet 
service construction. This self-managing layer, called 
a distributed data structure (DDS), presents a conven- 
tional single-site data structure interface to service au- 
thors, but partitions and replicates the data across a clus- 
ter. We have designed and implemented a distributed 
hash table DDS that has properties necessary for Inter- 
net services (incremental scaling of throughput and data 
capacity, fault tolerance and high availability, high con- 
currency, consistency, and durability). The hash table 
uses two-phase commits to present a coherent view of 
its data across all cluster nodes, allowing any node to 
service any task. We show that the distributed hash 
table simplifies Internet service construction by decou- 
pling service-specific logic from the complexities of per- 
sistent, consistent state management, and by allowing 
services to inherit the necessary service properties from 
the DDS rather than having to implement the proper- 
ties themselves. We have scaled the hash table to a 128 
node cluster, 1 terabyte of storage, and an in-core read 
throughput of 61,432 operations/s and write throughput 
of 18,582 operations/s. 


1 Introduction 


Internet services are successfully bringing infras- 
tructural computing to the masses. Millions of peo- 
ple depend on Internet services for applications like 
searching, instant messaging, directories, and maps, 
and also to safeguard and provide access to their per- 
sonal data (such as email and calendar entries). As 
a direct’ consequence of this increasing user depen- 
dence, today’s Internet services must possess many 
of the same properties as the telephony and power 
infrastructures. These service properties include the 
ability to scale to large, rapidly growing user popula- 
tions, high availability in the face of partial failures, 
strictly maintaining the consistency of users’ data, 
and operational manageability. 

It is challenging for a service to achieve all of 
these properties, especially when it must manage 
large amounts of persistent state, as this state must 


remain available and consistent even if individual 
disks, processes, or processors crash. Unfortunately, 
the consequences of failing to achieve the proper- 
ties are harsh, including lost data, angry users, and 
perhaps financial liability. Even worse, there appear 
to be few reusable Internet service construction plat- 
forms (or data management platforms) that success- 
fully provide all of the properties. 

Many projects and products propose using soft- 
ware platforms on clusters to address these chal- 
lenges and to simplify Internet service construction 
[1, 2, 6, 15]. These platforms typically rely on com- 
mercial databases or distributed file systems for per- 
sistent data management, or they do not address 
data management at all, forcing service authors to 
implement their own service-specific data manage- 
ment layer. We argue that databases and file sys- 
tems have not been designed with Internet service 
workloads, the service properties, and cluster envi- 
ronments specifically in mind, and as a result, they 
fail to provide the right scaling, consistency, or avail- 
ability guarantees that services require. 

In this paper, we bring scalable, available, and 
consistent data management capabilities to cluster 
platforms by designing and implementing a reusable, 
cluster-based storage layer, called a distributed data 
structure (DDS), specifically designed for the needs 
of Internet services. A DDS presents a conven- 
tional single site in-memory data structure interface 
to applications, and durably manages the data be- 
hind this interface by distributing and replicating 
it across the cluster. Services inherit the aforemen- 
tioned service properties by using a DDS to store 
and manage all persistent service state, shielding 
service authors from the complexities of scalable, 
available, persistent data storage, thus simplifying 
the process of implementing new Internet services. 

We believe that given a small set of DDS types 
(such as a hash table, a tree, and an administra- 
tive log), authors will be able to build a large class 
of interesting and sophisticated servers. This pa- 
per describes the design, architecture, and imple- 
mentation of one such distributed data structure (a 
distributed hash table built in Java). We evaluate 
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its performance, scalability and availability, and its 
ability to simplify service construction. 


1.1 Clusters of Workstations 


In [15], it is argued that clusters of workstations 
(commodity PC’s with a high-performance network) 
are a natural platform for Internet services. Each 
cluster node is an independent failure boundary, 
which means that replicating computation and data 
can provide fault tolerance. A cluster permits in- 
cremental scalability: if a service runs out of ca- 
pacity, a good software architecture allows nodes to 
be added to the cluster, linearly increasing the ser- 
vice’s capacity. A cluster has natural parallelism: 
if appropriately balanced, all CPUs, disks, and net- 
work links can be used simultaneously, increasing 
the throughput of the service as the cluster grows. 
Clusters have high throughput, low latency redun- 
dant system area networks (SAN) that can achieve 
1 Gb/s throughput with 10 to 100 ps latency. 


1.2. Internet Service Workloads 


Popular Internet services process hundreds of 
millions of tasks per day. A task is usually “small”, 
causing a small amount of data to be transferred 
and computation to be performed. For example, 
according to press releases, Yahoo (http://www. 
yahoo.com) serves 625 million page views per day. 
Randomly sampled pages from the Yahoo directory 
average 7KB of HTML data and 10KB of image 
data. Similarly, AOL’s web proxy cache (http: 
//waw.aol.com) handles 5.2 billion web requests per 
day, with an average response size of 5.5 KB. Ser- 
vices often take hundreds of milliseconds to process 
a given task, and their responses can take many sec- 
onds to flow back to clients over what are predom- 
inantly low bandwidth last-hop network links [19]. 
Given this high task throughput and non-negligible 
latency, a service may handle thousands of tasks si- 
multaneously. Human users are typically the ulti- 
mate source of tasks; because users usually generate 
a small number of concurrent tasks (e.g., 4 parallel 
HTTP GET requests are typically spawned when 
a user requests a web page), the large set of tasks 
being handled by a service are largely independent. 


2 Distributed Data Structures 


A distributed data structure (DDS) is a self- 
managing storage layer designed to run on a clus- 
ter of workstations {2] and to handle Internet ser- 
vice workloads. A DDS has all of the previously 
mentioned service properties: high throughput, high 
concurrency, availability, incrementally scalability, 
and strict consistency of its data. Service authors 
see the interface to a DDS as a conventional data 
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Figure 1: High-level view of a DDS: a DDS is a 
self-managing, cluster-based data repository. All service 
instances (S) in the cluster see the same consistent im- 
age of the DDS; as a result, any WAN client (C) can 
communicate with any service instance. 


structure, such as a hash table, a tree, or a log. 
Behind this interface, the DDS platform hides all 
of the mechanisms used to access, partition, repli- 
cate, scale, and recover data. Because these com- 
plex mechanisms are hidden behind the simple DDS 
interface, authors only need to worry about service- 
specific logic when implementing a new service. The 
difficult issues of managing persistent state are han- 
dled by the DDS platform. 

Figure 1 shows a high-level illustration of a 
DDS. All cluster nodes have access to the DDS and 
see the same consistent image of the DDS. As long 
as services keep all persistent state in the DDS, any 
service instance in the cluster can handle requests 
from any client, although we expect clients will have 
affinity to particular service instances to allow ses- 
sion state to accumulate. 

The idea of having a storage layer to manage 
durable state is not new, of course; databases and 
file systems have done this for many decades. The 
novel aspects of a DDS are the level of abstraction 
that it presents to service authors, the consistency 
model it supports, the access behavior (concurrency 
and throughput demands) that it presupposes, and 
its many design and implementation choices that are 
made based on its expected runtime environment 
and the types of failures that it should withstand. 
A direct comparison between databases, distributed 
file systems, and DDS’s helps to show this. 

Relational database management systems 
(RDBMS): an RDBMS offers extremely strong 
durability and consistency guarantees, namely 
ACID properties derived from the use of transac- 
tions {18], but these ACID properties can come at 
high cost in terms of complexity and overhead. Asa 
result, Internet services that rely on RDBMS back- 
ends typically go to great lengths to reduce the work- 
load presented to the RDBMS, using techniques 
such as query caching in front ends (15, 21, 32]. 
RDBMS’s offer a high degree of data independence, 
which is a powerful abstraction that adds addi- 
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tional complexity and performance overhead. The 
many layers of most RDBMS’s (such as SQL pars- 
ing, query optimization, access path selection, etc.) 
permit users to decouple the logical structure of 
their data from its physical layout. This decou- 
pling allows users to dynamically construct and issue 
queries over the data that are limited only by what 
can be expressed in the SQL language, but data in- 
dependence can make parallelization (and therefore 
scaling) hard in the general case. From the per- 
spective of the service properties, an RDBMS al- 
ways chooses consistency over availability: if there 
are media or processor failures, an RDBMS can be- 
come unavailable until the failure is resolved, which 
is unacceptable for Internet services. 


Distributed file systems: file systems have 
less strictly defined consistency models. Some (e.g., 
NFS [31]) have weak consistency guarantees, while 
others (e.g., Frangipani (33] or AFS [12]) guarantee 
a coherent filesystem image across all clients, with 
locking typically done at the granularity of files. The 
scalability of distributed file systems similarly varies; 
some use centralized file servers, and thus do not 
scale. Others such as xF'S {3] are completely server- 
less, and in theory can scale to arbitrarily large ca- 
pacities. File systems expose a relatively low level 
interface with little data independence; a file sys- 
tem is organized as a hierarchical directory of files, 
and files are variable-length arrays of bytes. These 
elements (directories and files) are directly exposed 
to file system clients; clients are responsible for log- 
ically structuring their application data in terms of 
directories, files, and bytes inside those files. 


Distributed data structures (DDS): a DDS 
has a strictly defined consistency model: all opera- 
tions on its elements are atomic, in that any oper- 
ation completes entirely, or not at all. DDS’s have 
one-copy equivalence, so although data elements in a 
DDS are replicated, clients see a single, logical data 
item. Two-phase commits are used to keep replicas 
coherent, and thus all clients see the same image of 
a DDS through its interface. Transactions across 
multiple elements or operations are not currently 
supported: as we will show later, many of our cur- 
rent protocol design decisions and implementation 
choices exploit the lack of transactional support for 
greater efficiency and simplicity. There are Inter- 
net services that require transactions (eg. for e- 
commerce); we can imagine building a transactional 
DDS, but it is beyond the scope of this paper, and we 
believe that the atomic single-element updates and 
coherence provided by our current DDS are strong 
enough to support interesting services. 

A DDS’s interface is more structured and at a 
higher level than that of a file system. The granu- 
larity of an operation is a complete data structure 


element rather than an arbitrary byte range. The 
set of operations over the data in a DDS is fixed by 
asmall set of methods exposed by the DDS API, un- 
like an RDBMS in which operations are defined by 
the set of expressible declarations in SQL. The query 
parsing and optimization stages of an RDBMS are 
completely obviated in a DDS, but the DDS inter- 
face is less flexible and offers less data independence. 


In summary, by choosing a level of abstraction 
somewhere in between that of an RDBMS and a file 
system, and by choosing a well-defined and simple 
consistency model, we have been able to design and 
implement a DDS with all of the service properties. 
It has been our experience that the DDS interfaces, 
although not as general as SQL, are rich enough to 
successfully build sophisticated services. 


3 Assumptions and Design Principles 


In this section of the paper, we present the de- 
sign principles that guided us while building our dis- 
tributed hash table DDS. We also state a number of 
key assumptions we made regarding our cluster en- 
vironment, failure modes that the DDS can handle, 
and the workloads it will receive. 

Separation of concerns: the clean separation 
of service code from storage management simplifies 
system architecture by decoupling the complexities 
of state management from those of service construc- 
tion. Because persistent service state is kept in the 
DDS, service instances can crash (or be gracefully 
shut down) and restart without a complex recovery 
process. This greatly simplifies service construction, 
as authors need only worry about service-specific 
logic, and not the complexities of data partitioning, 
replication, and recovery. 

Appeal to properties of clusters: in addi- 
tion to the properties listed in section 1.1, we re- 
quire that our cluster is physically secure and well- 
administered. Given all of these properties, a clus- 
ter represents a carefully controlled environment in 
which we have the greatest chance of being able to 
provide all of the service properties. For example, its 
low latency SAN (10-100 ps instead of 10-100 ms for 
the wide-area Internet) means that two-phase com- 
mits are not prohibitively expensive. The SAN’s 
high redundancy means that the probability of a 
network partition can be made arbitrarily small, and 
thus we need not consider partitions in our proto- 
cols. An uninterruptible power supply (UPS) and 
good system administration help to ensure that the 
probability of system-wide simultaneous hardware 
failure is extremely low; we can thus rely on data 
being available in more than one failure boundary 
(i.e., the physical memory or disk of more than one 
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node) while designing our recovery protocols.! 

Design for high throughput and high con- 
currency: given the workloads presented in section 
1.2, the control structure used to effect concurrency 
is critical. Techniques often used by web servers, 
such as process-per-task or thread-per-task, do not 
scale to our needed degree of concurrency. Instead, 
we use asynchronous, event-driven style of control 
flow in our DDS, similar to that espoused by modern 
high performance servers [5, 20] such as the Harvest 
web cache [8] and Flash web server [28]. A conve- 
nient side-effect of this style is that layering is inex- 
pensive and flexible, as layers can be constructed by 
chaining together event handlers. Such chaining also 
facilitates interposition: a “middleman” event han- 
dler can be easily and dynamically patched between 
two existing handlers. In addition, if a server ex- 
periences a burst of traffic, the burst is absorbed in 
event queues, providing graceful degradation by pre- 
serving the throughput of the server but temporar- 
ily increasing latency. By contrast, thread-per-task 
systems degrade in both throughput and latency if 
bursts are absorbed by additional threads. 


3.1 Assumptions 


If one DDS node cannot communicate with an- 
other, we assume it is because this other node has 
stopped executing (due to a planned shutdown or a 
crash); we assume that network partitions do not 
occur inside our cluster, and that DDS software 
components are fail-stop. The need for no network 
partitions is addressed by the high redundancy of 
our network, as previously mentioned. We have at- 
tempted to induce fail-stop behavior in our software 
by having it terminate its own execution if it en- 
counters an unexpected condition, rather than at- 
tempting to gracefully recover from such a condi- 
tion. These strong assumptions have been valid in 
practice; we have never experienced an unplanned 
network partition in our cluster, and our software 
has always behaved in a fail-stop manner. We fur- 
ther assume that software failures in the cluster are 
independent. We replicate all durable data at more 
than one place in the cluster, but we assume that 
at least one replica is active (has not failed) at all 
times. We also assume some degree of synchrony, 
in that processes take a bounded amount of time 
to execute tasks, and that messages take a bounded 
amount of time to be delivered. 

We make several assumptions about the work- 
load presented to our distributed hash tables. A 
table’s key space is the set of 64-bit integers; we 


lWe do have a checkpoint mechanism (discussed later) 
that permits us to recover in the case that any of these cluster 
properties fail, however all state changes that happen after 
the last checkpoint will be lost should this occur. 
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assume that the population density over this space 
is even (i.e. the probability that a given key exists 
in the table is a function of the number of values 
in the table, but not of the particular key). We 
don’t assume that all keys are accessed equiproba- 
bly, but rather that the “working set” of hot keys is 
larger than the number of nodes in our cluster. We 
then assume that a partitioning strategy that maps 
fractions of the keyspace to cluster nodes based on 
the nodes’ relative processing speed will induce a 
balanced workload. Our current DDS design does 
not gracefully handle a small number of extreme 
hotspots (i.e., if a handful of keys receive most of 
the workload). If there are many such hotspots, 
however, then our partitioning strategy will proba- 
bilistically balance them across the cluster. Failure 
of these workload assumptions can result in load im- 
balances across the cluster, leading to a reduction in 
throughput. 

Finally, we assume that tables are large and long 
lived. Hash table creations and destructions are rel- 
atively rare events: the common case is for hash 
tables to serve read, write, and remove operations. 


4 Distributed Hash Tables: Archi- 


tecture and Implementation 


In this section, we present the architecture and 
implementation of a distributed hash table DDS. 
Figure 2 illustrates our hash table’s architecture, 
which consists of the following components: 

Client: a client consists of service-specific soft- 
ware running on a client machine that communi- 
cates across the wide area with one of many service 
instances running in the cluster. The mechanism by 
which the client selects a service instance is beyond 
the scope of this work, but it typically involves DNS 
round robin [7], a service-specific protocol, or level 4 
or level 7 load-balancing switches on the edge of the 
cluster. An example of a client is a web browser, in 
which case the service would be a web server. Note 
that clients are completely unaware of DDS’s: no 
part of the DDS system runs on a client. 

Service: a service is a set of cooperating soft- 
ware processes, each of which we call a service in- 
stance. Service instances communicate with wide- 
area clients and perform some application-level func- 
tion. Services may have soft state (state which may 
be lost and recomputed if necessary), but they rely 
on the hash table to manage all persistent state. 

Hash table API: the hash table API is the 
boundary between a service instance and its “DDS 
library”. The API provides services with put(), 
get(), remove(), create(), and destroy() opera- 
tions on hash tables. Each operation is atomic, and 
all services see the same coherent image of all exist- 


USENIX Association 


USENIX Association 


| senvice | | hash table 
— apr 





~ redundant, low 
“san, +=. latency, high 








throughput 
: : network 
i ‘tee | fe Eo }2 Horage | backs 
; | Sone | | Torieke “brick” a anole “node, 
if pee : durable hash 
j | storage [storage ae | [ storage] storage table 
i | “brick” brick" | Lobrick® | 
i; Soe 

cluster 


Figure 2: Distributed hash table architecture: 
each box in the diagram represents a software process. In 
the simplest case, each process runs on its own physical 
machine, however there is nothing preventing processes 
from sharing machines. 


ing hash tables through this API. Hash table names 
are strings, hash table keys are 64 bit integers, and 
hash table values are opaque byte arrays; operations 
affect hash table values in their entirety. 

DDS library: the DDS library is a Java class 
library that presents the hash table API to services. 
The library accepts hash table operations, and co- 
operates with the “bricks” to realize those opera- 
tions. The library contains only soft state, includ- 
ing metadata about the cluster’s current configura- 
tion and the partitioning of data in the distributed 
hash tables across the “bricks”. The DDS library 
acts as the two-phase commit coordinator for state- 
changing operations on the distributed hash tables. 

Brick: bricks are the only system components 
that manage durable data. Each brick manages a 
set of network-accessible single node hash tables. A 
brick consists of a buffer cache, a lock manager, a 
persistent chained hash table implementation, and 
network stubs and skeletons for remote communica- 
tion. Typically, we run one brick per CPU in the 
cluster, and thus a 4-way SMP will house 4 bricks. 
Bricks may run on dedicated nodes, or they may 
share nodes with other components. 


4.1 Partitioning, Replication, and 


Replica Consistency 


A distributed hash table provides incremental 
scalability of throughput and data capacity as more 
nodes are added to the cluster. To achieve this, 
we horizontally partition tables to spread operations 
and data across bricks. Each brick thus stores some 
number of partitions of each table in the system, and 
when new nodes are added to the cluster, this parti- 


tioning is altered so that data is spread onto the new 
node. Because of our workload assumptions (section 
3.1), this horizontal partitioning evenly spreads both 
load and data across the cluster. 

Given that the data in the hash table is spread 
across multiple nodes, if any of those nodes fail, then 
a portion of the hash table will become unavailable. 
For this reason, each partition in the hash table is 
replicated on more than one cluster node. The set 
of replicas for a partition form a replica group; all 
replicas in the group are kept strictly coherent with 
each other. Any replica can be used to service a 
get(), but all replicas must be updated during a 
put() or remove(). If a node fails, the data from its 
partitions is available on the surviving members of 
the partitions’ replica groups. Replica group mem- 
bership is thus dynamic; when a node fails, all of 
its replicas are removed from their replica groups. 
When a node joins the cluster, it may be added to 
the replica groups of some partitions (such as in the 
case of recovery, described later). 

To maintain consistency when state changing 
operations (put () and remove()) are issued against 
a partition, all replicas of that partition must be 
synchronously updated. We use an optimistic two- 
phase commit protocol to achieve consistency, with 
the DDS library serving as the commit coordinator 
and the replicas serving as the participants. If the 
DDS library crashes after prepare messages are sent, 
but before any commit messages are sent, the repli- 
cas will time out and abort the operation. 

However, if the DDS library crashes after send- 
ing out any commits, then all replicas must com- 
mit. For the sake of availability, we do not rely on 
the DDS library to recover after a crash and issuing 
pending commits. Instead, replicas store short in- 
memory logs of recent state changing operations and 
their outcomes. If a replica times out while waiting 
for a commit, that replica communicates with all of 
its peers to find out if any have received a commit 
for that operation, and if so, the replica commits as 
well; if not, the replica aborts. Because all peers 
in the replica group that time out while waiting for 
a commit communicate with all other peers, if any 
receives a commit, then all will commit. 

Any replica may abort during the first phase 
of the two-phase commit (e.g., if the replica cannot 
obtain a write lock on a key). If the DDS library 
receives any abort messages at the end of the first 
phase, it sends aborts to all replicas in the second 
phase. Replicas do not commit side-effects unless 
they receive a commit message in the second phase. 

If a replica crashes during a two-phase commit, 
the DDS library simply removes it from its replica 
group and continues onward. Thus, all replica 
groups shrink over time; we rely on a recovery mech- 
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Figure 3: Distributed hash table metadata maps: 
this illustration highlights the steps taken to discover the 
set of replica groups which serve as the backing store for 
a specific hash table key. The key is used to traverse the 
DP map trie and retrieve the name of the key’s replica 
group. The replica group name is then used looked up 
in the RG map to find the group’s current membership. 


anism (described later) for crashed replicas to rejoin 
the replica group. We made the significant optimiza- 
tion that the image of each replica must only be con- 
sistent through its brick’s cache, rather than having 
a consistent on-disk image. This allows us to have 
a purely conflict-driven cache eviction policy, rather 
than having to force cache elements out to ensure 
on-disk consistency. An implication of this is that if 
all members of a replica group crash, that partition 
is lost. We assume nodes are independent failure 
boundaries (section 3.1); there must be no system- 
atic software failure across nodes, and the cluster’s 
power supply must be uninterruptible. 

Our two-phase commit mechanism gives atomic 
updates to the hash table. It does not, however, give 
transactional updates. If a service wishes to update 
more than one element atomically, our DDS does 
not provide any help. Adding transactional support 
to our DDS infrastructure is a topic of future work, 
but this would require significant additional com- 
plexity such as distributed deadlock detection and 
undo/redo logs for recovery. 

We do have a checkpoint mechanism in our dis- 
tributed hash table that allows us to force the on- 
disk image of all partitions to be consistent; the disk 
images can then be backed up for disaster recov- 
ery. This checkpoint mechanism is extremely heavy- 
weight, however; during the checkpointing of a hash 
table, no state-changing operations are allowed. We 
currently rely on system administrators to decide 
when to initiate checkpoints. 


4.2 Metadata maps 


To find the partition that manages a particular 
hash table key, and to determine the list of replicas 
in partitions’ replica groups, the DDS libraries con- 
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sult two metadata maps that are replicated on each 
node of the cluster. Each hash table in the cluster 
has its own pair of metadata maps. 

The first map is called the data partitioning 
(DP) map. Given a hash table key, the DP map 
returns the name of the key’s partition. The DP 
map thus controls the horizontal partitioning of data 
across the bricks. As shown in figure 3, the DP map 
is a trie over hash table keys; to find a key’s parti- 
tion, key bits are used to walk down the trie, starting 
from the least significant key bit until a leaf node is 
found. As the cluster grows, the DP trie subdivides 
in a “split” operation. For example, partition 10 
in the DP trie of figure 3 could split into partitions 
010 and 110; when this happens, the keys in the old 
partition are shuffled across the two new partitions. 
The opposite of a split is a “merge”; if the cluster is 
shrunk, two partitions with a common parent in the 
trie can be merged into their parent. For example, 
partitions 000 and 100 in figure 3 could be merged 
into a single partition 00. 

The second map is called the replica group (RG) 
membership map. Given a partition name, the RG 
map returns a list of bricks that are currently serv- 
ing as replicas in the partition’s replica group. The 
RG maps are dynamic: if a brick fails, it is removed 
from all RG maps that contain it. A brick joins 
a replica group after finishing recovery. An invari- 
ant that must be preserved is that the replica group 
membership maps for all partitions in the hash table 
must have at least one member. 

The maps are replicated on each cluster node, 
in both the DDS libraries and the bricks. The maps 
must be kept consistent, otherwise operations may 
be applied to the wrong bricks. Instead of enforcing 
consistency synchronously, we allow the libraries’ 
maps to drift out of date, but lazily update them 
when they are used to perform operations. The 
DDS library piggybacks hashes of the maps? on op- 
erations sent to bricks; if a brick detects that either 
map used is out of date, the brick fails the operation 
and returns a “repair” to the library. Thus, all maps 
become eventually consistent as they are used. Be- 
cause of this mechanism, libraries can be restarted 
with out of date maps, and as the library gets used 
its maps become consistent. 

To put() a key and value into a hash table, 
the DDS library servicing the operation consults its 
DP map to determine the correct partition for the 
key. It then looks up that partition name in its RG 
map to find the current set of bricks serving as repli- 
cas, and finally performs a two-phase commit across 
these replicas. To do a get() of a key, a similar 
process is used, except that the DDS library can 


2It is important to use large enough of a hash to make the 
probability of collision negligible; we currently use 32 bits. 
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select any of the replicas listed in the RG map to 
service the read. We use the locality-aware request 
distribution (LARD) technique [14] to select a read 
replica—LARD further partitions keys across repli- 
cas, in effect aggregating their physical caches. 


4.3. Recovery 


If a brick fails, all replicas on it become un- 
available. Rather than making these partitions un- 
available, we remove the failed brick from all replica 
groups and allow operations to continue on the sur- 
viving replicas. When the failed brick recovers (or 
an alternative brick is selected to replace it), it must 
“catch up” to all of the operations it missed. In 
many RDBMS’s and file systems, recovery is a com- 
plex process that involves replaying logs, but in our 
system we use properties of clusters and our DDS 
design for vast simplifications. 

Firstly, we allow our hash table to “say no”— 
bricks may return a failure for an operation, such 
as when a two-phase commit cannot obtain locks on 
all bricks (e.g., if two puts() to the same key are 
simultaneously issued), or when replica group mem- 
berships change during an operation. The freedom 
to say no greatly simplifies system logic, since we 
don’t worry about correctly handling operations in 
these rare situations. Instead, we rely on the DDS 
library (or, ultimately, the service and perhaps even 
the WAN client) to retry the operation. Secondly, 
we don’t allow any operation to finish unless all par- 
ticipating components agree on the metadata maps. 
If any component has an out-of-date map, opera- 
tions fail until the maps are reconciled. 

We make our partitions relatively small 
(~100MB), which means that we can transfer an en- 
tire partition over a fast system-area network (typ- 
ically 100 Mb/s to 1 Gb/s) within 1 to 10 seconds. 
Thus, during recovery, we can incrementally copy 
entire partitions to the recovering node, obviating 
the need for the undo and redo logs that are typi- 
cally maintained by databases for recovery. When 
a node initiates recovery, it grabs a write lease on 
one replica group member from the partition that 
it is joining; this write lease means that all state- 
changing operations on that partition will start to 
fail. Next, the recovering node copies the entire 
replica over the network. Then, it sends updates 
to the RG map to all other replicas in the group, 
which means that DDS libraries will start to lazily 
receive this update. Finally, it releases the write 
lock, which means that the previously failed oper- 
ations will succeed on retry. The recovery of the 
partition is now complete, and the recovering node 
can begin recovery of other partitions as necessary. 

There is an interesting choice of the rate at 
which partitions are transferred over the network 


during recovery. If this rate is fast, then the involved 
bricks will suffer a loss in read throughput during the 
recovery. If this rate is slow, then the bricks won’t 
lose throughput, but the partition’s mean time to re- 
covery will increase. We chose to recover as quickly 
as possible, since in a large cluster only a small frac- 
tion of the total throughput of the cluster will be 
affected by the recovery. 

A similar technique is used for DP map split 
and merge operations, except that all replicas must 
be modified and both the RG and DP maps are up- 
dated at the end of the operation. 


4.3.1 Convergence of Recovery 


A challenge for fault-tolerant systems is to re- 
main consistent in the face of repeated failures; our 
recovery scheme described above has this property. 
In steady state operation, all replicas in a group 
are kept perfectly consistent. During recovery, state 
changing operations fail (but only on the recovering 
partition), implying that surviving replicas remain 
consistent and recovering nodes have a stable image 
from which to recover. We also ensure that a recov- 
ering node only joins the replica group after it has 
successfully copied over the entire partition’s data 
but before it release its write lease. A remaining 
window of vulnerability in the system is if recov- 
ery takes longer than the write lease; if this seems 
imminent, the recovering node could aggressively re- 
new its write lease, but we have not currently im- 
plemented this behavior. 

If a recovering node crashes during recovery, its 
write lease will expire and the system will continue 
as normal. If the replica on which the lease was 
grabbed crashes, the recovering node must reiniti- 
ate recovery with another surviving member of the 
replica group. If all members of a replica group 
crash, data will be lost, as mentioned in Section 3.1. 


4.4 Asynchrony 


All components of the distributed hash table 
are built using an asynchronous, event-driven pro- 
gramming style. Each hash table layer is designed 
so that only a single thread ever executes in it at 
a time. This greatly simplified implementation by 
eliminating the need for data locks, and race condi- 
tions due to threads. Hash table layers are separated 
by FIFO queues, into which I/O completion events 
and I/O requests are placed. The FIFO discipline 
of these queues ensures fairness across requests, and 
the queues act as natural buffers that absorb bursts 
that exceed the system’s throughput capacity. 

All interfaces in the system (including the DDS 
library APIs) are split-phase and asynchronous. 
This means that a hash table get() doesn’t block, 
but rather immediately returns with an identifier 
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Figure 4: Throughput scalability: this benchmark 
shows the linear scaling of throughput as a function of 
the number of bricks serving in a distributed hash table; 
note that both axis have logarithmic scales. As we added 
more bricks to the DDS, we increased the number of 
clients using the DDS until throughput saturated. 


that can be matched up with a completion event 
that is delivered to a caller-specified upcall handler. 
This upcall handler can be application code, or it 
can be a queue that is polled or blocked upon. 


5 Performance 


In this section, we present performance bench- 
marks of the distributed hash table implementation 
that were gathered on a cluster of 28 2-way SMPs 
and 38 4-way SMPs (a total of 208 500 MHz Pentium 
CPUs). Each 2-way SMP has 500 MB of RAM, and 
each 4-way SMP has 1 GB. All are connected with 
either 100 Mb/s switched Ethernet (2-way SMPs) 
or 1 Gb/s switched Ethernet (4-way SMPs). The 
benchmarks are run using Sun’s JDK 1.1.7v3, using 
the OpenJIT 1.1.7 JIT compiler and “green” (user- 
level) threads on top of Linux v2.2.5. 

When running our benchmarks, we evenly 
spread hash table bricks amongst 4-way and 2-way 
SMPs, running at most one brick node per CPU in 
the cluster. Thus, 4-way SMPs would have at most 4 
brick processes running on them, while 2-way SMPs 
would have at most 2. We also made use of these 
cluster nodes as load generators; because of this, we 
were only able to gather performance numbers to 
a maximum of a 128 brick distributed hash table, 
as we needed the remaining 80 CPUs to generate 
enough load to saturate such a large table. 


5.1 In-Core Benchmarks 


Our first set of benchmarks tested the in-core 
performance of the distributed hash table.: By lim- 
iting the working set of keys that we requested to a 
size that fits in the aggregate physical memory of the 
bricks, this set of benchmarks investigates the over- 
head and throughput of the distributed hash table 
code independently of disk performance. 
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Figure 5: Graceful degradation of reads: this 
graph demonstrates that the read throughput from a 
distributed hash table remains constant even if the of- 
fered load exceeds the capacity of the hash table. 


5.1.1 Throughput Scalability 


This benchmark demonstrates that hash ta- 
ble throughput scales linearly with the number of 
bricks. The benchmark consists of several services 
that each maintain a pipeline of 100 operations (ei- 
ther gets() or puts()) to a single distributed hash 
table. We varied the number of bricks in the hash 
table; for each configuration, we slowly increased 
the number of services and measured the comple- 
tion throughput flowing from the bricks. All config- 
urations had 2 replicas per replica group, and each 
benchmark iteration consisted of reads or writes of 
150-byte values. The benchmark was closed-loop: a 
new operation was immediately issued with a ran- 
dom key for each completed operation. 

Figure 4 shows the maximum throughput sus- 
tained by the distributed hash table as a function of 
the number of bricks. Throughput scales linearly up 
to 128 bricks; we didn’t have enough processors to 
scale the benchmark further. The read throughput 
achieved with 128 bricks is 61,432 reads per second 
(5.3 billion per day), and the write throughput with 
128 bricks is 13,582 writes per second (1.2 billion 
per day); this performance is adequate to serve the 
hit rates of most popular web sites on the Internet. 


5.1.2 Graceful Degradation for Reads 


Bursts of traffic are a common phenomenon for 
all Internet services. If a traffic burst exceeds the 
service’s capacity, the service should have the prop- 
erty of “graceful degradation”: the throughput of 
the service should remain constant, with the excess 
traffic either being rejected or absorbed in buffers 
and served with higher latency. Figure 5 shows the 
throughput of a distributed hash table as a func- 
tion of the number of simultaneous read requests 
issued to it; each service instance has a closed-loop 
pipeline of 100 operations. Each line on the graph 
represents a different number of bricks serving the 
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Figure 6: Write imbalance leading to ungraceful 
degradation: the bottom curve shows the throughput 
of a two-brick partition under overload, and the top two 
curves show the CPU utilization of those bricks. One 
brick is saturated, the other becomes only 30% busy. 


hash table. Each configuration is seen to eventually 
reach a maximum throughput as its bricks saturate. 
This maximum throughput is successfully sustained 
even as additional traffic is offered. The overload 
traffic is absorbed in the FIFO event queues of the 
bricks; all tasks are processed, but they experience 
higher latency as the queues drain from the burst. 


5.1.3 Ungraceful Degradation for Writes 


An unfortunate performance anomaly emerged 
when benchmarking put() throughput. As the of- 
fered load approached the maximum capacity of the 
hash table bricks, the total write throughput sud- 
denly began to drop. On closer examination, we 
discovered that most of the bricks in the hash ta- 
ble were unloaded, but one brick in the hash table 
was completely saturated and had become the bot- 
tleneck in the closed-loop benchmark. 

Figure 6 illustrates this imbalance. To generate 
it, we issued puts() to a hash table with a single 
partition and two replicas in its replica group. Each 
put () operation caused a two-phase commit across 
both replicas, and thus each replica saw the same set 
of network messages and performed the same com- 
putation (but perhaps in slightly different orders). 
We expected beth replicas to perform identically, 
but instead one replica became more and more idle, 
and the throughput of the hash table dropped to 
match the CPU utilization of this idle replica. 

Investigation showed that the busy replica was 
spending a significant amount of time in garbage 
collection. As more live objects populated that 
replica’s heap, more time needed to be spent garbage 
collecting to reclaim a fixed amount of heap space, as 
more objects would be examined before a free object 
was discovered. Random fluctuations in arrival rates 
and garbage collection caused one replica to spend 
more time garbage collecting than the other. This 
replica became the system bottleneck, and more 
operations piled up in its queues, further amplify- 
ing this imbalance. Write traffic particularly ex- 
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Figure 7: Throughput vs. read size the X axis shows 
the size of values read from the hash table, and the Y 
axis shows the maximum throughput sustained by an 8 
brick hash table serving these values. 


acerbated the situation, as objects created by the 
“prepare” phase must wait for at least one network 
round-trip time before a commit or abort command 
in the second phase is received. The number of live 
objects in each bricks’ heap is thus proportional to 
the bandwidth-delay product of hash table put () 
operations. For read traffic, there is only one phase, 
and thus objects can be garbage collected immedi- 
ately after read requests are satisfied. 

We experimented with many JDKs, but consis- 
tently saw this effect. Some JDKs (such as JDK 
1.2.2 on Linux 2.2.5) developed this imbalance for 
read traffic as well as write traffic. This sort of per- 
formance imbalance is fundamental to any system 
that doesn’t perform admission control; if the task 
arrival rate temporarily exceeds the system’s abil- 
ity to handle them, then tasks will begin to pile 
up in the system. Because systems have finite re- 
sources, this inevitably causes performance degra- 
dation (thrashing). In our system, this degradation 
first materialized due to garbage collection. In other 
systems, this might happen due to virtual memory 
thrashing, to pick an example. We are currently ex- 
ploring using admission control (at either the bricks 
or the hash table libraries) or early discard from 
bricks’ queues to keep the bricks within their oper- 
ational range, ameliorating this imbalance. 


5.1.4 Throughput Bottlenecks 


In figure 7, we varied the size of elements that 
we read out of an 8 brick hash table. Throughput 
was flat from 50 bytes through 1000 bytes, but then 
began to degrade. From this we deduced that per- 
operation overhead (such as object creation, garbage 
collection, and system call overhead) saturated the 
bricks’ CPUs for elements smaller than 1000 bytes, 
and per-byte overhead (byte array copies, either in 
the TCP stack or in the JVM) saturated the bricks’ 
CPUs for elements greater than 1000 bytes. At 8000 
bytes, the throughput in and out of each 2-way SMP 
(running 2 bricks) was 60 Mb/s. For larger sized 
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hash table values, the 100 Mb/s switched network 
became the throughput bottleneck. 


5.2 Out-of-core Benchmarks 


Our next set of benchmarks tested performance 
for workloads that do not fit in the aggregate phys- 
ical memory of the bricks. These benchmarks stress 
the single-node hash table’s disk interaction, as well 
as the performance of the distributed hash table. 


5.2.1 A Terabyte DDS 


To test how well the distributed hash table 
scales in terms of data capacity, we populated a hash 
table with 1.28 terabytes of 8KB data elements. To 
do this, we created a table with 512 partitions in its 
DP map, but with only 1 replica per replica group 
(i.e., the table would not withstand node failures). 
We spread the 512 partitions across 128 brick nodes, 
and ran 2 bricks per node in the cluster. Each brick 
stored its data on a dedicated 12GB disk (all cluster 
nodes have 2 of these disks). The bricks each used 
10GB worth of disk capacity, resulting in 1.28TB of 
data stored in the table. 

To populate the 1.28TB hash table, we designed 
bulk loaders that generated writes to keys in an or- 
der that was carefully chosen to result in sequential 
disk writes. These bulk loaders understood the par- 
titioning in the DP map and implementation details 
about the single-node tables’ hash functions (which 
map keys to disk blocks). Using these loaders, it 
took 130 minutes to fill the table with 1.28 terabytes 
of data, achieving a total write throughput of 22,015 
operations/s, or 1.4 MB/s per disk. 

Comparatively, the in-core throughput bench- 
mark presented in Section 5.1.1 obtained 13,582 op- 
erations/s for a 128 brick table, but that bench- 
mark was configured with 2 replicas per replica 
group. Eliminating this replication would double 
the throughput of the in-core benchmark, result- 
ing in a 27,164 operations/s. The bulk loading of 
the 1.28TB hash table was therefore only marginally 
slower in terms of the throughput sustained by each 
replica than the in-core benchmarks, which means 
that disk throughput was not the bottleneck. 


5.2.2. Random Write and Read Throughput 


However, we believe it is unrealistic and unde- 
sirable for hash table clients to have knowledge of 
the DP map and single-node tables’ hash functions. 
We ran a second set of throughput benchmarks on 
another 1.28TB hash table, but populated it with 
random keys. With this workload, the table took 
319 minutes to populate, resulting in a total write 
throughput of 8,985 operations/s, or 0.57 MB/s per 
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disk. We similarly sustained a read throughput of 
14,459 operations/s, or 0.93 MB/s per disk.? 

This throughput is substantially lower than the 
throughput obtained during the in-core benchmarks 
because the random workload generated results in 
random read and write traffic to each disk. In fact, 
for this random workload, every read() issued to 
the distributed hash table results in a request for a 
random disk block from a disk. All disk traffic is 
seek dominated, and disk seeks become the overall 
bottleneck of the system. 

We expect that there will be significant locality 
in DDS requests generated by Internet services, and 
given workloads with high locality, the DDS should 
perform nearly as well as the in-core benchmark re- 
sults. However, it might be possible to significantly 
improve the write performance of traffic with lit- 
tle locality by using disk layout techniques similar 
to those of log-structured file systems [29]; we have 
not explored this possibility as of yet. 


5.3 Availability and Recovery 


To demonstrate availability in the face of node 
failures and the ability for the bricks to recover af- 
ter a failure, we repeated the read benchmark with 
a hash table of 150 byte elements. The table was 
configured with a single 100MB partition and three 
replicas in that partition’s replica group. Figure 8 
shows the throughput of the hash table over time 
as we induced a fault in one of the replica bricks 
and later initiated its recovery. During recovery, the 
rate at which the recovered partition is copied was 
12 MB/s, which is maximum sequential write band- 
width we could obtain from the bricks’ disks. 

At point (1), all three bricks were operational 
and the throughput sustained by the hash table was 
450 operations per second. At point (2), one of the 
three bricks was killed. Performance immediately 
dropped to 300 operations per second, two-thirds 
of the original capacity. Fault detection was imme- 
diate: client libraries experienced broken transport 
connections that could not be reestablished. The 
performance overhead of the replica group map up- 
dates could not be observed. At point (3), recov- 
ery was initiated, and recovery completed at point 
(4). Between points (3) and (4), there was no no- 
ticeable performance overhead of recovery; this is 
because there was ample excess bandwidth on the 
network, and the CPU overhead of transferring the 
partition during recovery was negligible. It should 
be noted that between points (3) and (4), the recov- 


3Write throughput is less than read throughput because a 
hash bucket must be read before it can written, in case there 
is already data stored in that bucket that must be preserved. 
There is therefore an additional read for every write, nearly 
halving the effective throughput for DDS writes. 
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Figure 8: Availability and Recovery: this bench- 
mark shows the read throughput of a 3-brick hash table 
as a deliberate single-node fault is induced, and after- 
wards as recovery is performed. 


ering partition is not available for writes, because of 
the write lease grabbed during recovery. This parti- 
tion is available for reads, however. 

After recovery completed, performance briefly 
dropped at point (5). This degradation is due to the 
buffer cache warming on the recovered node. Once 
the cache became warm, performance resumed to 
the original 450 operations/s at point (6). An inter- 
esting anomaly at point (6) is the presence of notice- 
able oscillations in throughput; these were traced to 
garbage collection triggered by the “extra” activity 
of recovery. When we repeated our measurements, 
we would occasionally see this oscillation at other 
times besides immediately post-recovery. This sort 
of performance unpredictability due to garbage col- 
lection seems to be a pervasive problem; a better 
garbage collector or admission control might ame- 
liorate this, but we haven’t yet explored this. 


6 Example Services 


We have implemented a number of interesting 
services using our distributed hash table. The ser- 
vices’ implementation was greatly simplified by us- 
ing the DDS, and they trivially scaled by adding 
more service instances. An aspect of scalability not 
covered by using the hash table was the routing and 
load balancing of WAN client requests across service 
instances, but this is beyond the scope of this work. 

Sanctio: Sanctio is an instant messaging gate- 
way that provides protocol translation between pop- 
ular instant messaging protocols (such as Mirabilis’ 
ICQ and AOL’s AIM), conventional email, and voice 
messaging over cellular telephones. Sanctio is a mid- 
dleman between these protocols, routing and trans- 
lating messages between the networks. In addition 
to protocol translation, Sanctio also can transform 
the message content. We have built a “web scraper” 
that allows us to compose AltaVista’s BabelFish 
natural language translation service with Sanctio. 
We can thus perform language translation (e.g., En- 
glish to French) as well as protocol translation; a 


Spanish speaking ICQ user can send a message to 
an English speaking AIM user, with Sanctio provid- 
ing both language and protocol translation. 

A user may be reached on a number of different 
addresses, one for each of the networks that Sanctio 
can communicate with. The Sanctio service must 
therefore keep a large table of bindings between 
users and their current transport addresses on these 
networks; we used the distributed hash table for this 
purpose. The expected workload on the DDS in- 
cludes significant write traffic generated when users 
change networks or log in and out of a network. The 
data in the table must be kept consistent, otherwise 
messages will be routed to the wrong address. 

Sanctio took 1 person-month to develop, most 
which was spent authoring the protocol translation 
code. The code that interacts with the distributed 
hash table took less than a day to write. 

Web server: we have implemented a scalable 
web server using the distributed hash table. The 
server speaks HTTP to web clients, hashes requested 
URLs into 64 bit keys, and requests those keys from 
the hash table. The server takes advantage of the 
event-driven, queue-centric programming style to 
introduce CGLlike behavior by interposing on the 
URL resolution path. This web server was written 
in 900 lines of Java, 750 of which deals with HTTP 
parsing and URL resolution, and only 50 of which 
deals with interacting with the hash table DDS. 

Others: We have built many other services 
as part of the Ninja project*. The “Parallelisms” 
service recommends related sites to user-specified 
URLs by looking up ontological entries in an inver- 
sion of the Yahoo web directory. We built a collab- 
orative filtering engine for a digital music jukebox 
service [16]; this engine stores users’ music prefer- 
ences in a distributed hash table. We have also im- 
plemented a private key store and a composable user 
preference service, both of which use the distributed 
hash table for persistent state management. 


7 Discussion 


Our experience with the distributed hash table 
implementation has taught us many lessons about 
using it as a storage platform for scalable services. 
The hash table was a resounding success in simpli- 
fying the construction of interesting services, and 
these services inherited the scalability, availability, 
and data consistency of the hash table. Exploiting 
properties of clusters also proved to be remarkably 
useful. In our experience, most of the assumptions 
that we made regarding properties of a clusters and 
component failures (specifically the fail-stop behav- 


4‘nttp://ninja.cs.berkeley.edu/ 
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ior of our software and the probabilistic lack of net- 
work partitions in the cluster) were valid in practice. 

One of our assumptions was initially problem- 
atic: we observed a case where there was a system- 
atic failure of all replica group members inside a 
single replica group. This failure was caused by a 
software bug that enabled service instances to deter- 
ministically crash remote bricks by inducing a null 
pointer exception in the JVM. After fixing the as- 
sociated bug in the brick, this situation never again 
arose. However, it serves as a reminder that sys- 
tematic software bugs can in practice bring down 
the entire cluster at once. Careful software engi- 
neering and a good quality assurance cycle can help 
to ameliorate this failure mode, but we believe that 
this issue is fundamental to all systems that promise 
both availability and consistency. 

As we scaled our distributed hash table, we 
noticed scaling bottlenecks that weren’t associated 
with our own software. At 128 bricks, we ap- 
proached the point at which the 100 Mb/s Ether- 
net switches would saturate; upgrading to 1 Gb/s 
switches throughout the cluster would delay this sat- 
uration. We also noticed that the combination of our 
JVM’s user-level threads and the Linux kernel be- 
gan to induced poor scaling behavior as each node 
in the cluster opened up a reliable TCP connection 
to all other nodes in the cluster. The brick processes 
began to saturate due to a flood of signals from the 
kernel to the user-level thread scheduler associated 
with TCP connections with data waiting to be read. 


7.1 Java as a Service Platform 


We found that Java was an adequate platform 
from which to build a scalable, high performance 
subsystem. However, we ran into a number of seri- 
ous issues with the Java language and runtime. The 
garbage collector of all JVMs that we experimented 
with inevitably became the performance bottleneck 
of the bricks and also a source of throughput and 
latency variation. Whenever the garbage collector 
became active, it had a serious impact on all other 
system activity, and unfortunately, current JVMs do 
not provide adequate interfaces to allow systems to 
control garbage collection behavior. 

The type safety and array bounds checking fea- 
tures of Java vastly accelerated our software engi- 
neering process, and helped us to write stable, clean 
code. However, these features got in the way of code 
efficiency, especially when dealing with multiple lay- 
ers of a system each of which wraps some array of 
data with layer-specific metadata. We often found 
ourselves performing copies of regions of byte arrays 
in order to maintain clean interfaces to data regions, 
whereas in a C implementation it would be more 
natural to exploit pointers into malloc’ ed memory 
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regions to the same effect without needing copies. 
Java lacks asynchronous I/O primitives, which 
necessitated the use of a thread pool at the lowest- 
layer of the system. This is much more efficient 
than a thread-per-task system, as the number of 
threads in our system is equal to the number of 
outstanding I/O requests rather than the number 
of tasks. Nonetheless, it introduced performance 
overhead and scaling problems, since the number 
of TCP connections per brick increases with the 
cluster size. We are working on introducing high- 
throughput asynchronous I/O completion mecha- 
nisms into the JVM using the JNI native interface. 


7.2 Future Work 


We plan on investigating more interesting data- 
parallel operations on a DDS (such as an iterator, 
or the Lisp maplist() operator). We also plan on 
building other distributed data structures, includ- 
ing a B-tree and an administrative log. In doing 
so, we hope to reuse many of the components of 
the hash table, such as the brick storage layer, the 
RG map infrastructure, and the two-phase commit 
code. We would like to explore caching in the DDS 
libraries (we currently rely on services to build their 
own application-level caches). We are also exploring 
adding other single-element operations to the hash 
table, such as testandset(), in order to provide 
locks and leases to services that may have many ser- 
vice instances competing to write to the same hash 
table element. 


8 Related Work 


Litwin et al.’s scalable, distributed data struc- 
tures (SDDS) such as RP* [22, 26] helped to mo- 
tivate our own work. RP* focuses on algorithmic 
properties, while we focused on the systems issues 
of implementing an SDDS that satisfies the concur- 
rency, availability, and incremental scalability needs 
of Internet services. 

Our work has a great deal in common with 
database research. The problems of partitioning 
and replicating data across shared-nothing multi- 
computers has been studied extensively in the dis- 
tributed and parallel database communities [10, 17, 
25]. We use mechanisms such as horizontal parti- 
tioning and two-phase commits, but we do not need 
an SQL parser or a query optimization layer since 
we have no general-purpose queries in our system. 

We also have much in common with distributed 
and parallel file systems [3, 23, 31, 33]. A DDS 
presents a higher-level interface than a typical file 
system, and DDS operations are data-structure spe- 
cific and atomically affect entire elements. Our re- 
search has focused on scalability, availability, and 
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consistency under high throughput, highly concur- 
rent traffic, which is a different focus than file sys- 
tems. Our work is most similar to Petal [24], in that 
a Petal distributed virtual disk can be thought of as 
a simple hash table with fixed sized elements. Our 
hash tables have variable sized elements, an addi- 
tional name space (the set of hash tables), and fo- 
cus on Internet service workloads and properties as 
opposed to file system workloads and properties. 

The CMU network attached secure disk 
(NASD) architecture [11] explores variable-sized ob- 
ject interfaces as an abstraction to allow storage sub- 
systems to optimize disk layout. This is similar to 
our own data structure interface, which is deliber- 
ately higher-level than the block or file interfaces of 
Petal and parallel or distributed file systems. 

Distributed object stores [13] attempt to trans- 
parently adding persistence to distributed object 
systems. The persistence of (typed) objects is typi- 
cally determined by reachability through the transi- 
tive closure of object references, and the removal of 
objects is handled by garbage collection. A DDS has 
no notion of pointers or object typing, and applica- 
tions must explicitly use API operations to store and 
retrieve elements from a DDS. Distributed object 
stores are often built with the wide-area in mind, 
and thus do not focus on the scalability, availability, 
and high throughput requirements of cluster-based 
Internet services. 

Many projects have explored the use of clusters 
of workstations as a general-purpose platform for 
building Internet services {1, 4, 15]. To date, these 
platforms rely on file systems or databases for per- 
sistent state management; our DDS’s are meant to 
augment such platforms with a state management 
platform that is better suited to the needs of Inter- 
net services. The Porcupine project [30] includes a 
storage platform built specifically for the needs of 
a cluster-based scalable mail server, but they are 
attempting to generalize their storage platform for 
arbitrary service construction. 

There have been many projects that expolored 
wide-area replicated, distributed services [9, 27]. 
Unlike clusters, wide-area systems must deal with 
heterogeneity, network partitions, untrusted peers, 
high latency and low throughput networks, and mul- 
tiple administrative domains. Because of these dif- 
ferences, wide-area distributed systems tend to have 
relaxed consistency semantics and low update rates. 
However, if designed correctly, they can scale up 
enormously. 


9 Conclusions 


This paper presents a new persistent data man- 
agement layer that enhances the ability of clusters to 


support Internet services. This self-managing layer, 
called a distributed data structure (DDS), fills in an 
important gap in current cluster platforms by pro- 
viding a data storage platform specifically tuned for 
services’ workloads and for the cluster environment. 

This paper focused on the design and implemen- 
tation of a distributed hash table DDS, empirically 
demonstrating that it has many properties necessary 
for Internet services (incremental scaling of through- 
put and data capacity, fault tolerance and high avail- 
ability, high concurrency, and consistency and dura- 
bility of data). These properties were achieved by 
carefully designing the partitioning, replication, and 
recovery techniques in the hash table implementa- 
tion to exploit features of cluster environments (such 
as a low-latency network with a lack of network par- 
titions). By doing so, we have “right-sized” the DDS 
to the problem of persistent data management for 
Internet services. 

The hash table DDS simplifies Internet  ser- 
vice construction by decoupling service-specific logic 
from the complexities of persistent state manage- 
ment, and by allowing services to inherit the nec- 
essary service properties from the DDS rather than 
having to implement the properties themselves. 
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Abstract 


Single-language runtime systems, in the form of Java 
virtual machines, are widely deployed platforms for ex- 
ecuting untrusted mobile code. These runtimes pro- 
vide some of the features that operating systems pro- 
vide: inter-application memory protection and basic sys- 
tem services. They do not, however, provide the ability 
to isolate applications from each other, or limit their re- 
source consumption. This paper describes KaffeOS, a 
Java runtime system that provides these features. The 
KaffeOS architecture takes many lessons from operating 
system design, such as the use of a user/kernel bound- 
ary, and employs garbage collection techniques, such as 
write barriers. 

The KaffeOS architecture supports the OS abstraction 
of a process in a Java virtual machine. Each process exe- 
cutes as if it were run in its own virtual machine, includ- 
ing separate garbage collection of its own heap. The dif- 
ficulty in designing KaffeOS lay in balancing the goals 
of isolation and resource management against the goal of 
allowing direct sharing of objects. Overall, KaffeOS is 
no more than 11% slower than the freely available JVM 
on which it is based, which is an acceptable penalty for 
the safety that it provides. Because of its implementation 
base, KaffeOS is substantially slower than commercial 
JVMs for trusted code, but it clearly outperforms those 
JVMs in the presence of denial -of-service attacks or mis- 
behaving code. 


1 Introduction 


The need to support the safe execution of untrusted 
programs in runtime systems for type-safe languages has 
become clear. Language runtimes are being used in 
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many environments for executing untrusted code: for 
example, applets, servlets, active packets [41], database 
queries [15], and kernel extensions [6]. Current systems 
(such as Java) provide memory protection through the 
enforcement of type safety and secure system services 
through a number of mechanisms, including namespace 
and access control. Unfortunately, malicious or buggy 
applications can deny service to other applications. For 
example, a Java applet can generate excessive amounts 
of garbage and cause a Web browser to spend all of its 
time collecting it. 

To support the execution of untrusted code, type-safe 
language runtimes need to provide a mechanism to iso- 
late and manage the resources of applications, analogous 
to that provided by operating systems. Although other re- 
source management abstractions exist [4], the classic OS 
process abstraction is appropriate. A process is the basic 
unit of resource ownership and control; it provides iso- 
lation between applications. On a traditional operating 
system, untrusted code can be forked in its own process; 
CPU and memory limits can be placed on the process; 
and the process can be killed if it is uncooperative. 

A number of approaches to isolating applications in 
Java have been developed by others over the last few 
years. An applet context [9] is an example of an 
application-specific approach. It provides a separate 
namespace and a separate set of execution permissions 
for untrusted applets. Applet contexts do not support re- 
source management, and cannot defend against denial- 
of-service attacks. In addition, they are not general: ap- 
plet contexts are specific to applets, and cannot be used 
easily in other environments. 

Several general-purpose models for isolating appli- 
cations in Java do exist, such as the J-Kermel [23] or 
Echidna [21]. However, these solutions superimpose 
an operating system kernel abstraction on Java without 
changing the underlying virtual machine. As a result, 
it is impossible in those systems to account for resources 
spent on behalf of a given application: for example, CPU 
time spent while garbage collecting a process’s heap. 

An alternative approach to separate different applica- 
tions is to give each one its own virtual machine, and 
run each virtual machine in a different process on an un- 
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derlying OS [25, 29]. For instance, most operating sys- 
tems can limit a process’s heap size or CPU consump- 
tion. Such mechanisms could be used to directly limit an 
entire VM’s resource consumption, but they depend on 
underlying operating system support. 

Designing JVMs to support multiple processes is a su- 
perior approach. First, it reduces per-application over- 
head. For example, applications on KaffeOS can share 
classes in the same way that an OS allows applications 
to share libraries. Second, communication between pro- 
cesses can be more efficient in one VM, since objects can 
be shared directly. (One of the reasons for using type- 
safe language technology in systems such as SPIN [6] 
was to reduce the cost of IPC; we want to keep that goal.) 
Third, embedding a JVM in another application, such as 
a web server or web browser, is difficult (or impossible) 
if the JVM relies on an operating system to isolate dif- 
ferent activities. Fourth, embedded or portable devices 
may not provide OS or hardware support for managing 
processes. Finally, a single JVM uses less energy than 
multiple JVM’s on portable devices [19]. 

Our work consists of supporting processes in a modern 
type-safe language, Java. Our solution, KaffeOS, adds a 
process model to Java that allows a JVM to run multiple 
untrusted programs safely, and still supports the direct 
sharing of resources between programs. The difficulty 
in designing KaffeOS lay in balancing conflicting goals: 
process isolation and resource management versus direct 
sharing of objects between processes. 

A KaffeOS process is a general-purpose mechanism 
that can easily be used in multiple application domains. 
For instance, KaffeOS could be used in a browser to sup- 
port multiple applets, within a server to support multiple 
servlets, or even to provide a standalone “Java OS” on 
bare hardware. We have structured our abstractions and 
APIs so that they are as broadly applicable as possible, 
much as the OS process abstraction is. Because the Kaf- 
feOS architecture is designed to support processes, we 
have taken lessons from the design of traditional operat- 
ing systems, such as the use of a user/kernel boundary. 

Our design makes KaffeOS’s isolation and resource 
control mechanisms comprehensive. We focus on the 
management of CPUtime and memory, although we plan 
to address other resources such as network bandwidth. 
The runtime system is able to account for and control 
all of the CPU and memory resources consumed on be- 
half of any process. We have dealt with these issues by 
structuring the KaffeOS virtual machine so that it sepa- 
rates the resources used by different processes as much 
as possible. 

To summarize, this paper makes the following contri- 
butions: 
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We describe how lessons from building traditional 
operating systems can and should be used to struc- 
ture runtime systems for type-safe languages. 


We describe how software mechanisms in the com- 
piler and runtime can be used to implement isolation 
and resource management in a Java virtual machine. 


We describe the design and implementation of Kaf- 
feOS. KaffeOS implements our process model in 
Java, which isolates applications from each other, 
provides resource management mechanisms for 
them, and also lets them share resources directly. 


We show that the performance penalty for using 
KaffeOS is reasonable, compared to the freely avail- 
able JVM on which it is based. Even though, due to 
that implementation base, KaffeOS is substantially 
slower than commercial JVMs on standard bench- 
marks, it outperforms those JVMs in the presence 
of uncooperative code. 


Sections 2 and 3 describe and discuss the design and 
implementation of KaffeOS. Section 4 provides some 
performance measurements of KaffeOS, and compares 
its performance with that of some commercial Java vir- 
tual machines. Section 5 describes related work in more 
detail, and Section 6 summarizes our conclusions and re- 
sults. 


2 Design Principles 


The following principles drove our design of KaffeOS, 
in decreasing order of importance: 


e Process separation. We provide the “classical” 
property of a process: each process is given the il- 
lusion of having the whole virtual machine to itself. 


Safe termination of processes. Processes may ter- 
minate abruptly due to either an internal error or an 
external event. In both cases, we ensure that the in- 
tegrity of other processes and the system itself is not 
violated. 


Direct sharing between processes. Processes can 
directly share objects in order to communicate with 
each other. 


Precise memory and CPU accounting. The mem- 
ory and CPU time spent on almost all activities can 
be attributed to the application on whose behalf it 
was expended. 
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e Full reclamation of memory. When a process is 
terminated, its memory must be fully reclaimed. 
In a language-based system, memory cannot be re- 
voked by unmapping pages: it must be garbage- 
collected. We restrict a process’s heap writes to 
avoid uncollectable memory in the presence of di- 
rect object sharing. 


e Hierarchical memory management. Memory al- 
location can be managed in a hierarchy, which pro- 
vides a simple model for controlling processes. 


The interaction between these design principles is com- 
plex. For expository purposes, we discuss these princi- 
ples in a slightly different order in the remainder of this 
section. 


Process separation. A process cannot accidentally or 
intentionally access another process’ data, because each 
process has its own heap. A heap constitutes of a mem- 
ory pool managed by an allocator and a garbage collec- 
tor. Each process is given its own name space for its 
objects and classes, as well. Type safety provides mem- 
ory protection, so that a process cannot access other pro- 
cess’s objects. 

To ensure process separation, an untrusted process is 
not allowed to hold onto system-level resources indefi- 
nitely. For instance, global kernel locks are not directly 
accessible to user processes. Violations of this restriction 
are instances of bad system design. Similarly, faults in 
one process must not impact progress in other processes. 


Safe termination of processes. KaffeOS is structured 
such that critical parts of the system cannot be damaged 
when a process is terminated. For example, a process is 
not allowed to terminate when it is holding a lock ona 
system resource. 

We divide KaffeOS into user and kernel parts [2], an 
important distinction used in operating system design. A 
user/kernel distinction is necessary to maintain system 
integrity in the presence of process termination. 

Figure 1 illustrates the high-level structure of Kaf- 
feOS. User code executes in “user mode,” as do some 
of the trusted runtime libraries and some of the garbage 
collection code. The remaining parts of the system (the 
rest of the runtime libraries and the garbage collector, 
as well as the virtual machine itself) must run in kernel 
mode to ensure their integrity. Note that “user mode” 
and “‘kemel mode” do not indicate a change in hardware 
privileges. Instead, they indicate different environments 
with respect to termination and resource consumption: 





User code (untrusted) 








Runtime Libraries 
User mode 


(trusted) 


eo) os 


| 


Kernel code (trusted) 


Kernel mode 





Figure 1: Structure of KaffeOS. System code is divided into 
kernel and user modes; user code all runs in user mode. In user 
mode, code can be terminated arbitrarily; in kernel mode, code 
cannot be terminated arbitrarily. 


e Processes running in user mode can be terminated 
at will. Processes running in kernel mode cannot be 
terminated at an arbitrary time, because they must 
leave the kernel in a clean state. 


e Resources consumed in user mode are always 
charged to a user process, and not to the system as 
a whole. Only in kernel mode can a process con- 
sume resources that are charged to the entire sys- 
tem, although typically such use is charged to the 
appropriate user process. 


Such a structure echoes that of exokernels [18], where 
system-level code executes as a user-mode library. Note 
that a language-based system allows the kernel to trust 
user-mode code to a great extent, because type safety 
prevents user code from damaging any user-mode sys- 
tem code. 

The KaffeOS kernel is structured so that it can han- 
dle termination requests and internal errors cleanly. Ter- 
mination requests are deferred, so that a process cannot 
be terminated while manipulating kernel data structures. 
Kernel code must not abruptly terminate due to internal 
exceptions, for the same reason. Violations of these two 
restrictions are considered kernel bugs. 

Others have suggested that depending on language- 
level exception handling is sufficient for safe termina- 
tion. We disagree, because exceptions interact poorly 
with code in critical sections, which leaves shared data 
structures open to corruption. Even if termination re- 
quests were deferred during critical sections, one would 
need transactional support to ensure the integrity of mu- 
tually related data structures in the absence of a kernel. 
In addition, such an approach would bea confusing over- 
loading of the concepts of mutual exclusion and deferred 
termination; preventing termination while any lock is 
held would also violate isolation. 
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Full reclamation of memory. Since Java is type-safe, 
it does not provide a primitive to reclaim memory. In- 
stead, unreachable memory is freed by a garbage collec- 
tor. We use the garbage collector to recover all the mem- 
ory of a process when it terminates. Therefore, we must 
prevent situations where the collector cannot free a ter- 
minated process’s objects because another process still 
holds references to them. 

We use techniques from distributed garbage collection 
schemes [31] to restrict cross-process references. Dis- 
tributed GC mechanisms are normally used to overcome 
the physical separation of machines and create the im- 
pression of a global shared heap. We use distributed GC 
mechanisms to manage multiple heaps in a single address 
space, so that they can be collected independently. 

We use write barriers [43] to restrict writes. A write 
barrier is a check that happens on every pointer write to 
the heap. As we show in Section 4, the cost of using 
write barriers, although non-negligible, is reasonable. 

Illegal cross-references are those that would prevent a 
process’s memory from being reclaimed: for example, 
references from one user heap to another. Since those 
references cannot exist, it is possible to reclaim a pro- 
cess’s heap as soon as the process is terminated. Writes 
that would create illegal cross-references are forbidden, 
and raise exceptions. We call such exceptions “segmen- 
tation violations.” Although it may seem surprising that 
a type-sate language runtime could throw such a fault, it 
actually follows the analogy to traditional operating sys- 
tems closely. 

Unlike distributed garbage collection, in KaffeOS 
inter-heap cycles do not cause problems. The only form 
of inter-heap cycles that can occur are due to data struc- 
tures that are split between a user heap and the kernel 
heap, since there can be no cycles that span multiple user 
heaps. Writes of user-heap references to kernel objects 
can only be done by trusted code. The kernel is coded 
so that it only writes a user-heap reference to a kernel 
object whose lifetime equals that of the user process: for 
example, the object that represents the process itself. 

KaffeOS is intended to run on a wide range of systems. 
We assume that the platforms on which it runs will not 
necessarily have a hardware memory management unit 
under the control of KaffeOS. We also assume that the 
host may not have an operating system that supports vir- 
tual memory. For example, a Palm Pilot satisfies both 
of these assumptions. Under these assumptions, memory 
cannot simply be revoked by unmapping it. 


Precise memory and CPU accounting. We account 
for memory and CPU on a per-process basis, so as to 
limit their consumption by buggy or possibly malicious 
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code. In addition, to prevent denial-of-service attacks, it 
is necessary to minimize the amount of time and memory 
spent servicing kernel requests. 


Memory accounting is complete. It applies not only to 
objects at the Java level, but to all allocations done in the 
VM on behalf of a given process. In contrast, bytecode- 
rewriting approaches that do not modify the virtual ma- 
chine, such as Jres [13, 14], can only account for object 
allocations. 


We try to minimize the number of objects that are al- 
located on the kernel heap through careful coding of the 
kernel interfaces. For instance, consider a system call 
that creates a new process with a new heap: the process 
object itself, which is large, is allocated on the new heap. 
The handle that is returned to the creating process to con- 
trol the new process is allocated on the creating process’s 
heap. The kernel heap only maintains a small entry in a 
process table. 


We increase the accuracy of CPU accounting by min- 
imizing the time spent in non-preemptible sections of 
code. In addition, separately collecting user heaps and 
the kernel heap reduces the amount of time spent in the 
kernel. We again use write barriers: here, to detect cross- 
references from a user to the kernel heap, and vice versa. 
For each such reference, we create an entry item in the 
heap to which it points [31]. In addition, we create a 
special exit item in the original heap to remember the 
entry item created in the destination heap. Unlike dis- 
tributed object systems such as Emerald [26], entry and 
exit items are not used for naming non-local objects; we 
only use them to decouple the garbage collection of dif- 
ferent heaps. 


Entry items are reference counted: they keep track of 
the number of exit items that point to them. The ref- 
erence count of an entry item is decremented when an 
exit item is garbage collected. If an entry item’s refer- 
ence count reaches zero, the entry item is removed, and 
the referenced object can be garbage collected if it is not 
reachable through some other path. 


A process’s memory is reclaimed upon termination by 
merging its heap with the kernel heap. All exit items are 
destroyed at this point and the corresponding entry items 
are updated. The kernel heap’s collector can then collect 
all of the memory, including memory on the kernel heap 
that was kept alive by the process. User-kernel cycles 
of garbage objects can be collected at this time. Note 
that a user process could attempt to create and kill and 
large number of new heaps to deny service to other pro- 
cesses. Such an attempt can only prevented by imposing 
additional restrictions on the number or frequency with 
which a process may invoke kernel services. 
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Figure 2: Heap structure in KaffeOS. The kernel heap can 
contain pointers into the user heaps, but the shared heaps and 
other user heaps cannot. User heaps can contain pointers into 
the kernel heap and shared heaps. 


Direct sharing between processes. One of the reasons 
for using a language-based system is to allow for di- 
rect communication between applications. For example, 
the SPIN operating system allowed kernel extensions to 
communicate directly through pointers to memory. The 
design of KaffeOS retains this design principle. Figure 2 
shows the different heaps in KaffeOS, and the kinds of 
inter-heap pointers that are legal. 

In KaffeOS, a process can dynamically create a shared 
heap to communicate with other processes. A shared 
heap holds ordinary objects that can be accessed in the 
usual manner. Shared objects are not allowed to have 
pointers to objects on any user heap, because those point- 
ers would prevent this user heap’s full reclamation. This 
restriction is again enforced by write barriers; attempts 
to assign such pointers will result in an exception. 

A shared heap has the following lifecycle. First, one 
process picks one or more shared types out of a central 
shared namespace, creates the heap, and loads the shared 
class or classes into it. While the heap is being created, 
the creator is charged for the whole heap. After the heap 
is populated with classes and objects, it is frozen and its 
size remains fixed for its lifetime. If other processes look 
up the shared heap, they are charged that amount. In this 
way, all sharers are charged for the heap. Processes ex- 
change data by writing into and reading from the shared 
objects and by synchronizing on them in the usual way. 

If a process drops all references to a shared heap, all 
exit items to that shared heap become unreachable. Af- 


ter the process garbage collects the last exit item to a 
shared heap, that shared heap’s memory is credited to the 
sharer’s budget. When the last sharer drops all references 
to a shared heap, the shared heap becomes orphaned. 
The kernel garbage collector checks for orphaned shared 
heaps at the beginning of each GC cycle and merges them 
into the kemel heap. 
This model guarantees three properties: 


e All sharers are charged in full for a shared heap 
while they are holding onto the shared heap, whose 
size is fixed. As a result, sharers do not have to 
be charged asynchronously if another sharer exits. 
(If n sharers were each to pay only 1/n of the 
cost of a shared heap, when one sharer exited the 
others would have to be asynchronously charged 
(1/n — 1) — (1/n) of the cost.) 


As already discussed, one process cannot use a 
shared object to keep objects in another process 
alive. 


e Sharers are charged accurately for all metadata, 
such as internal class data structures. The metadata 
is also allocated on the shared heap. Unfortunately, 
this prevents us from applying any optimization that 
allocates data structures related to the shared heap 
lazily during execution. 


Although process heaps can be scanned independently 
during GC, thread stacks still need to be scanned dur- 
ing GC for inter-heap references. Incremental schemes 
could be used to eliminate repeated scans of a stack [12], 
and a thread does not need to be scanned more than once 
while it is suspended. Some “GC crosstalk” between 
processes is still possible, because a process could cre- 
ate many threads in an effort to get the system to scan 
them all. We decided that the benefit of allowing direct 
sharing between processes is worth leaving open such a 
possibility. 


Hierarchical memory management. We provide a 
simple hierarchical model for managing memory. Each 
heap is associated with a memlimit, which consists of an 
upper limit and a current use. Memlimits form a hierar- 
chy: each one has a parent, except for a root memlimit. 
All memory allocated to the heap is debited from that 
memlimit, and memory collected from that heap is cred- 
ited to the memlimit. This process of crediting/debiting 
is applied recursively to the node’s parents. 

A memlimit can be hard or soft. This attribute influ- 
ences how credits and debits percolate up the hierarchy 
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ofmemlimits. A hard memlimit’s maximum limit is im- 
mediately debited from its parent, which amounts to set- 
ting memory aside. Credits and debits are therefore not 
propagated past a hard limit. A soft memlimit’s maxi- 
mum limit, on the other hand, is just a limit—credits and 
debits of a soft memlimit’s current usage are reflected in 
the parent. 

Hard and soft limits allow different memory manage- 
ment strategies. Hard limits allow for memory reserva- 
tions, but incur inefficient memory use if the limits are 
not used. Memory consumption matters, because we do 
not assume there is an underlying operating system; as 
a result, KaffeOS may manage physical memory. Soft 
limits allow the setting of a summary limit for multiple 
activities without incurring the inefficiences of hard lim- 
its. They can be used to guard malicious or buggy ap- 
plications where temporarily high memory usage can be 
tolerated. 

Another application of soft limits is during the creation 
of shared heaps. Shared heaps are initially associated 
with a soft memlimit that is a child of the creating pro- 
cess heap’s memlimit. In this way, they are separately 
accounted but still subject to their creator’s memlimit, 
which ensures that they cannot grow to exceed their cre- 
ator’s ability to pay. 


3. Discussion 


The KaffeOS VM is built on top of the freely avail- 
able Kaffe virtual machine, version 1.0b4 [42], which 
is roughly equivalent to JDK 1.1. In this section, we 
describe the specific issues that had to be dealt with in 
implementing KaffeOS. Many implementation decisions 
were driven by our desire to modify the Kaffe codebase 
as little as possible. 

The primary purpose of KaffeOS is to run Java ap- 
plications, which expect a well-defined environment of 
run-time services and libraries. We provide the standard 
Java API within KaffeOS. 

We make use of various features of Java to support 
KaffeOS processes: Java class loaders, in particular, de- 
serve some discussion. We also discuss our use of write 
barriers in more detail. Finally, we discuss some aspects 
of the Kaffe implementation that affect the performance 
that we can achieve with our KaffeOS prototype. 


3.1 Write Barriers 


An attempt to write a pointer to an object into a field 
of another object can have three different outcomes. In 
the common case, if a pointer to an object in the same 
heap is written, nothing needs to happen. If a pointer to 
a foreign heap is written, the write may either be aborted 


and trigger an exception, or it will cause the creation of 
a pair of exit/entry items to keep track of that allowable 
inter-heap reference. 

The option of aborting writes ensures that the separa- 
tion that is necessary for full reclamation is maintained. 
A write barrier exception is either related to a foreign 
user heap, or to a shared heap. If a pointer to a for- 
eign user heap is written, such a pointer must have been 
passed on the stack or in a register as a return value from 
a kernel call. Such write barrier violations indicate kernel 
bugs, since the kernel is not supposed to return foreign 
teferences to a user process. Write barriers violations on 
the shared heap, on the other hand, indicate attempts by 
user code to create a connection from the shared heap to 
the user heap. Such attempts may either be malicious, or 
a sign of a violation of the programming model imposed 
on shared objects. 

Keeping track of entry and exit items ensures the sep- 
aration that is necessary for independent garbage collec- 
tion and the garbage collection of shared heaps. In the 
third case, the write barrier code will maintain entry and 
exit items. As a result, a local garbage collector will 
know to include all incoming references as roots in its 
garbage collection cycle. 

Independent garbage collection, which relies on accu- 
rate bookkeeping of entry and exit items, is important 
in our model. Therefore, write barriers are necessary, 
if only to maintain entry and exit items. This statement 
holds true even if no shared heaps are being used. 

Write barriers could only be optimized away if their 
outcome is known. Such is the case within a procedure 
if static analysis reveals that an assignment is between 
pointers on the same heap (for instance, if newly con- 
structed objects are involved), or that a previous assign- 
ment must have had the same outcome. In addition, if 
a generational collector were used, it should be possi- 
ble to reduce the write barrier penalty by combining the 
code for the generational and the inter-heap write barrier 
checks. 


3.2 Namespaces 


Separate namespaces are provided in Java through the 
use of class loaders (28]. A class loader is an object that 
acts as a name server for types, and maps names to types. 
We use the Javaclass loading mechanism to provide Kaf- 
feOS processes with different namespaces. This use of 
Java class loaders is not novel, but is important because 
we have tried to make use of existing Java mechanisms 
when possible. When we use standard Java mechanisms, 
we can easily ensure that we donot violate the language’s 
semantics. 

Processes may share types for two reasons: either be- 
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cause the class in question is part of the run-time library 
(i.e., is a system or kernel class), or because it is the 
type of a shared object located on a shared heap, which 
must be identical in the processes that have access to the 
shared heap. We refer to the former as system-shared, 
and the latter as user-shared. Process loaders delegate 
the loading of all shared types to a shared loader. If we 
did not delegate to a single loader, KaffeOS would need 
to support a much more complicated type system for its 
user-shared objects. Using one shared loader makes the 
namespace for user-shared classes global, which requires 
global and prior coordination between communicating 
partners. We use a simple naming convention for this 
shared namespace: the Java package shared. * con- 
tains all user-shared classes. 


3.3. Java Class Libraries 


To determine which classes can be system-shared, we 
examined each class in the Java standard libraries [10] 
to see how it interacted under the semantics of class 
loading. A class’s members and their associated code 
are described by a sequence of bytes in a class file. 
Classes from identical class files that are loaded [28] 
by different class loaders are defined to be different 
in Java, even though they have identical behavior rela- 
tive to the namespace defined by the loader that loaded 
them. We refer to such classes as reloaded classes. 
Reloaded classes are analogous to traditional shared li- 
braries. Reloading a class gives each instance its own 
copies of static fields. In KaffeOS, Java classes could be 
reloaded; they could be modified to be shared across pro- 
cesses; or they could be used unchanged. For each class, 
we decided which alternative to choose, subject to two 
goals: to share as many classes as possible, but to make 
as few code changes as necessary. 

Certain classes must be shared between processes. For 
example, the java.lang.Object class, which is the 
superclass of all object types, must be shared. If this 
type were not shared, it would not be possible for dif- 
ferent processes to share generic objects! If a system- 
shared class uses static fields, and if these fields can- 
not be eliminated, they must be initialized with objects 
whose implementation is process-aware. Shared classes 
cannot directly refer to reloaded classes, because such 
references are represented using direct pointers by the 
run-time loader. 

Non-shared classes should always be reloaded, so 
that each process gets its own instance. Reloaded 
classes do not share text in our current implementa- 
tion, although they could. Because of some unfortu- 
nate decisions in the Java API design, some classes ex- 
port static members as part of their public interface, 


which forces those classes to be reloaded. For example, 
java.io.FileDescriptor must be reloaded, be- 
cause it exports the public static variables in, out, and 
err (stdin, stdout, and stderr, respectively). Other, pos- 
sibly more efficient, ways to accomplish the same thing 
as reloading exist [16], but their impact on type safety is 
not fully understood. Out of roughly 600 classes in the 
core Java libraries, we are able to safely system-share 
about 430 (72%) of them. The rest of the classes are 
reloaded. 


3.4 Java Language Issues 


A few language compatability issues arose when 
building KaffeOS. For example, the Java language de- 
scription assumes that all string literals are interned, and 
that equality can therefore be checked with a pointer 
comparison (the == operator). Unfortunately, to main- 
tain such semantics, the interned string table would have 
to be a global (kemel) data structure—and user processes 
could allocate strings in an effort to make the kernel run 
out of memory. To deal with this problem, we chose to 
separately intern strings for each process. As a result, the 
Java language use of pointer comparison to check string 
equality does not work for strings that were created in 
different heaps, and the equals method must be used 
instead. It is impractical for the JVM to hide this seman- 
tic change from applications. However, this issue arises 
only in rare situations, and then only in KaffeOS-aware 
applications that directly use KaffeOS features. 


3.5 Kaffe Limitations 


Kaffe has relatively poor performance compared to 
commercial JVMs, for several reasons. First, its garbage 
collector is relatively primitive: it is a mark-and-sweep 
collector that is neither generational nor incremental. 
Second, it has a simple just-in-time bytecode compiler 
that translates each instruction individually. As a result, 
many unnecessary register spills and reloads are gener- 
ated, and the native code that it produces is relatively 
poor. 


4 Results 


KaffeOS currently runs under Linux on the x86. We 
plan on porting it to the Itsy pocket computer from Com- 
paq WRL; we have already ported Kaffe to the Itsy. To 
demonstrate the effectiveness of KaffieOS, we ran the fol- 
lowing experiments: 


e We measured three implementations of the write 
barrier. We ran the SPEC JVM98 benchmarks [35] 
on different configurations of KaffeOS, and the ver- 
sion of Kaffe on which it is based, and the IBM 
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JVM, which uses one of the fastest commercial JIT 
compilers [36] available. We must note that our re- 
sults are not comparable with any published SPEC 
JVM98 metrics, as the measurements are not com- 
pliant with all of SPEC’s run rules. 


e We rana servlet engine on KaffeOS to demonstrate 
that KaffeOS can prevent denial-of-service servlets 
from crashing a server. We also compared how the 
number of KaffeOS processes scales with how the 
number of OS processes scales. 


Our measurements were all taken on a 800MHz “‘Kat- 
mai” Pentium III, with 256 Mbytes of SDRAM and a 
133 MHz PCI bus, running Red Hat Linux 6.2. The pro- 
cessor has a split 32K L1 cache, and combined 256K L2 
cache. 


4.1 Write Barrier Implementations 
To measure the cost of write barriers in KaffeOS, we 
implemented several versions: 


e No Write Barrier. We execute without a write bar- 
rier, and run everything on the kernel heap. 


No Heap Pointer. At each heap pointer write, the 
write barrier consists of a call to a routine that finds 
an object’s heap ID by looking at the page on which 
the object lies and performs the barrier checks. In 
order to avoid cache conflict misses, the actual heap 
ID is stored in a block descriptor that is not on the 
same page. This implementation takes 37 cycles 
with a hot cache. 


Heap Pointer. At each heap pointer write, the write 
barrier consists of a call to a routine that finds an ob- 
ject’s heap ID in the object header and performs the 
barrier checks. This implementation takes only 11 
cycles with a hot cache, but adds 4 bytes per object. 


Fake Heap Pointer. To measure the impact of the 
4 bytes of padding in the Heap Pointer implemen- 
tation, we use the third barrier implementation but 
add 4 bytes to each object. 


The KaffeOS JIT compiler does not yet inline the write 
barrier routine. Inlining the write barrier would not nec- 
essarily improve performance, as it would lead to sub- 
stantial code expansion. 

We ran the SPEC JVM98 benchmark suite on IBM’s 
JVM, on Kaffe00 and on KaffeOS with different imple- 
mentations of the write barrier. Kaffe00 is the code base 
upon which the current version of KaffeOS is built. This 
version is from June 2000. We instrumented Kaffe00 and 
KaffeOS to estimate how many cycles are spent during 


garbage collection. For IBM’s JVM, we used a com- 
mand line switch (-verbosegc) to obtain the number of 
milliseconds spent during garbage collection. 

Figure 3 compares the results of our experiments. 
Each group of bars corresponds to IBM’s JVM, Kaffe00, 
KaffeOS with no write barrier, KaffeOS with no heap 
pointer, KaffeOS with heap pointer, and KaffeOS with 
a fake heap pointer, in that order. The full bar displays 
the benchmark time as displayed by SPEC’s JVM98 out- 
put. The upper part of the bar shows the time spent on 
those garbage collections that occurred during the actual 
benchmark run. Note that we excluded those collections 
that occurred while the SPEC test harness executed. The 
lower part of the bar represents the time not spent during 
garbage collection. 

The time spent during garbage collection depends on 
the initial and maximum heap sizes, the allocation fre- 
quency, and the strategy used to decide when to collect. 
Kaffe00 and KaffeOS use a simple strategy: a collection 
is triggered whenever newly allocated memory exceeds 
125% of the memory in use at the last GC. However, 
while Kaffe00 uses the memory occupied by objects as 
its measure, KaffeOS uses the number of pages as its 
measure, because KaffeOS’s accounting mechanisms are 
designed to take internal fragmentation into account. In 
addition, KaffeOS decides when to collect for each heap 
separately. We do not know what strategy IBM’s JVM 
uses, but its GC performance suggests that it is very ag- 
gressive at keeping its heap small. 

Overall, IBM’s JVM is between 2-5 times faster 
than Kaffe00; we will focus on the differences between 
Kaffe00 and the different versions of KaffeOS. While 
Kaffe00 and KaffeOS use different strategies for decid- 
ing when to collect, they use the same conservative non- 
moving collector. For this reason, we will focus on the 
time not spent on garbage collection. 

The difference between Kaffe00 and KaffeOS no write 
barrier (excluding GC time) is minimal, which suggests 
that the changes done to Kaffe’s run-time do not have 
significant performance impact. The difference between 
KaffeOS no write barrier and KaffeOS no heap pointer 
stems from the write barrier overhead, and is consistently 
below 7%. 

Table 1 gives the number of write barriers that are ex- 
ecuted in each of the SPEC benchmarks. When we com- 
pute the time to execute the write barriers by using the 
cycle counts for the barriers, we see that it is a fraction 
of the actual penalty. This discrepancy occurs because 
the microbenchmark uses a hot cache. For most bench- 
marks, the heap pointer optimization is effective in re- 
ducing the write barrier penalty to less than 5%. Exclud- 
ing GC, KaffeOS fake heap pointer performs similarly 
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Figure 3: SPEC JVM98 run on various Java platforms. The error bars represent 95% confidence intervals. Each measurement is 
the result of three runs using SPEC’s autorun mode. The upper part represents time spent in garbage collection. 
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Table 1: Number of write barriers executed for each SPEC 
JVM98 benchmark. “Time” is the total CPU cycle cost for the 
write barrier instructions, assuming the No Heap Pointer cost 
of 37 cycles; “percent” is the fraction of the No Write Barrier 
execution time. 


to KaffeOS no heap pointer; however, its overall perfor- 

mance is lower because more time is spent during GC. 
On a better system with a more effective JIT, the rel- 

ative cost of using write barriers would increase. On the 


other hand, a good JIT compiler could perform several 
kinds of optimizations to remove write barriers. A com- 
piler should be able to remove redundant write barriers, 
along the lines of array bounds checking elimination. It 
could even perform method splitting to specialize meth- 
ods, so as to remove useless barriers along frequently 
used call paths. We can only speculate as to what the 
performance penalty for implementing KaffeOS on the 
IBMJVM would be. Nevertheless, as we will show, the 
performance of KaffeOS is much better than that of the 
IBM JVM in the presence of uncooperative applications, 
despite the raw performance difference between them. 


4.2 Servlet Engine 


A Java servlet engine provides an environment for run- 
ning Java programs (servlets) at a server. Their func- 
tionality subsumes that of CGI scripts at Web servers: 
for example, servlets may create dynamic content or run 
database queries. We use a MemHog servlet to measure 
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the effects of a denial-of-service attack. MemHog sits in 
a loop, repeatedly allocates memory, and keeps it from 
being garbage-collected. 

We compared KaffeOS’s ability to prevent the 
MemHog servlet from denying service with that of 
IBM’s JVM. We used Apache 1.3.12, JServ 1.1 
(Apache’s servlet engine), and a free version of JSDK 
2.0 to run our tests, without modification. JServ runs 
servlets in servlet zones, which are virtual servers. A 
single JServ instance can host one or more servlet zones. 
We ran each JServ in its own KaffeOS process. We com- 
pared KaffeOS against IBM’s JVM, in two configura- 
tions: one servlet zone per JVM (IBM/1), and multiple 
servlet zones in one JVM (IBM/n). Due to time con- 
straints, we used an earlier version of KaffeOS for these 
benchmarks. This version is about half as fast as the ver- 
sion used for the SPEC JVM benchmarks. 

When simulating this denial-of-service attack, we did 
what a system administrator concerned with availibil- 
ity of his services would do: we restarted the JVM(s) 
and the KaffeOS process, respectively, whenever they 
crashed because of the effects caused by MemHog. In 
KaffeOS, MemHog will cause a single JServ to exit with- 
out affecting other JServs. If each JServ is started in 
its own IBM JVM, the whole JVM will eventually crash 
and be restarted. If all servlets are run in a single JServ 
on a single IBM JVM, the system runs out of memory 
in seemingly random places. This behavior resulted in 
exceptions that occurred at random places, which in- 
cluded the code that manipulated data structures that 
were shared between servlets in the surrounding JServ 
environment. Eventually, these data structures became 
corrupted, which results in an unhandled exception in 
JServ, or in some instances even a crash of the entire 
JVM. 

Figure 4 illustrates the results of our experiments; note 
that the y axis uses a logarithmic scale. Running a sepa- 
rate KaffeOS process for each servlet has consistent per- 
formance, either with a MemHog running or without. 
This graph illustrates the most important feature of Kaf- 
feOS: that it can deliver consistent performance, even in 
the presence of uncooperative or malicious programs. 

The graph shows that running each of the servlets in a 
single IBM JVM does not scale. This failure occurs be- 
cause starting multiple JVMs eventually causes the ma- 
chine to thrash. We estimate that each IBM JVM process 
takes about 2MB of virtual memory upon startup. We 
limited each JVM’s heap size to 8MB in this configura- 
tion. An attempt to start 100 IBM JVMs rendered the 
machine inoperable. 

If there are no uncooperative servlets running, using 
a single IBM JVM has the best performance. If there 
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Figure 4: Scaling behavior of JVMs as the number of serviets 
increases. “IBM/1” means one IBM JVM per servlet; “IBM/n” 
means n servlets in one JVM. The “MemHog” measurements 
replace one of the good servlets with a MemHog. The y axis is 
the amount of time for the non-MemHog servlets to correctly 
respond to 1000 client requests. 


is a MemHog servlet running, such a configuration has 
worse performance than KaffeOS—despite the fact that 
KaffeOS is several times slower for individual servlets! 
This degradation is caused by a lack of isolation between 
servlets. However, as the ratio of well-behaved servlets 
to malicious servlets increases, the scheduler will yield 
less often to the malicious servlet. Consequently, the 
service of IBM/n,MemHog improves as the number of 
servlets increases. This effect is an artifact of our exper- 
imental setup and cannot be reasonably used to defend 
against denial-of-service attacks. 

Finally, we observe a slight service degradation as the 
number of KaffeOS processes increases. This degra- 
dation is likely due to inefficiencies in the user-mode 
threading system and scheduler. 


5 Related Work 


We classify the related work into three broad cate- 
gories: extensible operating systems, resource manage- 
ment in operating systems, and Java extensions for re- 
source management. 


5.1 Extensible Operating Systems 


Extensible operating systems have existed for many 
years. Most of them were not designed to protect against 
malicious users, although a number of them support 
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strong security features. None of them, however, pro- 
vides strong resource controls. Pilot [32] and Cedar [38] 
were two of the earliest language-based systems. Their 
development at Xerox PARC predates a flurry of re- 
search in the 1990’s on such systems. These systems in- 
clude Oberon [44] and Juice [20], which are based on the 
Oberon language; SPIN [6], which is based on Modula- 
3; and Inferno [17], which is based on a language called 
Dis. Such systems can be viewed as single-address-space 
operating systems (see Opal [11]) that use type safety for 
protection. 

VINO is a software-based (but not language-based) 
extensible system [34] that addresses resource manage- 
ment by wrapping kernel extensions within transactions. 
When an extension exceeds its resource limits, it can be 
safely aborted (even if it holds kernel locks) and its re- 
sources can be recovered. Transactions are a very effec- 
tive mechanism, but they are also relatively heavyweight. 


5.2 Resource Management 


Several operating systems projects have focused 
on quality-of-service issues and real-time performance 
guarantees. Nemesis [27] is a single-address-space OS 
that focuses on quality-of-service for multimedia ap- 
plications. Eclipse [8] introduced the concept of a 
reservation domain, which is a pool of guaranteed re- 
sources. Eclipse provides a guarantee of cumulative ser- 
vice, which means that processes execute at a predictable 
rate. It manages CPU, disk, and physical memory. Our 
work is orthogonal, because we examine the software 
mechanisms that are necessary to manage computational 
resources. 7 

Recent work on resource management has examined 
different forms of abstractions for computational re- 
sources. Banga et al. [4] describe an abstraction called 
resource containers, which are effectively accounts from 
which resource usage can be debited. Resource con- 
tainers are orthogonal to a process’ protection domain: 
a process can contain multiple resource containers, and 
processes can share resource containers. In KaffeOS we 
have concentrated on the mechanisms to simply allow 
resource management; resource-container-like mecha- 
nisms could be added in the future. 


5.3. Java Extensions 


Besides KaffeOS, a number of other research systems 
have explored (or are currently exploring) the problem of 
supporting processes in Java. 

The J-Kernel [23] and JRes [13, 14] projects at Cornell 
explore resource control issues without making changes 
to the Java virtual machine. The J-Kernel extends Java 


by supporting capabilities between processes. These ca- 
pabilities are indirection objects that can be used to iso- 
late processes from each other. JRes extends the J-Kernel 
with a resource management interface whose implemen- 
tation is portable across JVMs. The disadvantage of JRes 
(as compared to KaffeOS) is that Jres is a layer on top of 
a JVM; therefore, it cannot account for JVM resources 
consumed on the behalf of applications. Cornell is also 
exploring type systems that can support revocation di- 
rectly [24]. 

Alta [39, 40] is a Java virtual machine that enforces 
resource controls based on a nested process model. The 
nested process model in Alta allows processes to con- 
trol the resources and environment of other processes, in- 
cluding the class namespace. Additionally, Alta supports 
a more flexible sharing model that allows processes to 
directly share more than just objects of primitive types. 
Like KaffeOS, Alta is based on Kaffe, and, like KaffeOS, 
Alta provides support within the JVM for comprehensive 
memory accounting. However, Alta only provides a sin- 
gle, global garbage collector, so separation of garbage 
collection costs is not possible. For a more thorough dis- 
cussion of Alta and the J-Kernel, see Back et al [1]. 

Balfanz and Gong [3] describe a multi-processing 
JVM developed to explore the security architecture ram- 
ifications of protecting applications from each other, as 
opposed to just protecting the system from applications. 
They identify several areas of the JDK that assume a 
single-application model, and propose extensions to the 
JDK to allow multiple applications and to provide inter- 
application security. The focus of their multi-processing 
JVM is to explore the applicability of the JDK security 
model to multi-processing, and they rely on the existing, 
limited JDK infrastructure for resource control. 

Sun’s original JavaOS [37] was a standalone OS writ- 
ten almost entirely in Java. It is described as a first- 
class OS for Java applications, but appears to provide a 
single JVM with little separation between applications. 
It was to be replaced by a new implementation termed 
“JavaOS for Business” that also ran only Java applica- 
tions. “JavaOS for Consumers” is built on the Chorus mi- 
crokernel OS [33] to achieve real-time properties needed 
in embedded systems. Both of these systems apparently 
require a separate JVM for each Java application, and all 
run in supervisor mode. 

Joust [22], a JVM integrated into the Scout operating 
system [30], provides control over CPU time and net- 
work bandwidth. To do so, it uses Scout’s path abstrac- 
tion. However, Joust does not support memory limits on 
applications. 

The Open Group’s Conversant system [5] is another 
project that modifies a JVM to provide processes. It pro- 
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vides each process with a separate address range (within 
a single Mach task), a separate heap, and a separate 
garbage collection thread. Conversant does not support 
sharing between processes, unlike KaffeOS, Alta, and 
the J-Kernel. 

The Real-Time for Java Experts Group [7] has pub- 
lished a proposal to add real-time extensions to Java. 
This proposal provides for scoped memory areas with a 
limited lifetime, which can be implemented using multi- 
ple heaps that resemble KaffeOS’s heaps. The proposal 
also dictates the use of write barriers to prevent pointer 
assignments to objects in short-lived inner scopes. Real- 
Time Java’s main focus is to ensure predictable garbage 
collection characteristics in order to meet real-time guar- 
antees; it does not address untrusted applications. 


6 Conclusions 


We have described the design and implementation of 
KaffeOS, a Java virtual machine that supports the op- 
erating system abstraction of process. KaffeOS enables 
processes to be isolated from each other, to have their 
Tesources controlled, and still share objects directly. Pro- 
cesses enable the following important features: 


e The resource demands of Java processes can be 
accounted for separately, including memory con- 
sumption and GC time. 


e Java processes can be terminated if their resource 
demands are too high, without damaging the sys- 
tem. 


e Termination reclaims the resources of the termi- 
nated Java process. 


These features enable KaffeOS to run untrusted code 
safely, because it can prevent simple denial-of-service 
attacks that would disable standard JVMs. The cost of 
these features, relative to Kaffe, is reasonable. Because 
Kaffe’s performance is poor compared to commercial 
JVMs, it is difficult to estimate the cost of adding such 
features to a commercial J¥YM—but we believe that the 
overhead should not be excessive. Finally, even though 
KaffeOS is substantially slower than commercial JVMs, 
it exhibits much better performance scaling in the pres- 
ence of uncooperative code. 
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Abstract 


Knit is anew component definition and linking language 
for systems code. Knit helps make C code more under- 
standable and reusable by third parties, helps eliminate 
much of the performance overhead of componentiza- 
tion, detects subtle errors in component composition that 
cannot be caught with normal component type systems, 
and provides a foundation for developing future analyses 
over C-based components, such as cross-component op- 
timization. The language is especially designed for use 
with component kits, where standard linking tools pro- 
vide inadequate support for component configuration. In 
particular, we developed Knit for use with the OSKit, 
a large collection of components for building low-level 
systems. However, Knit is not OSKit-specific, and we 
have implemented parts of the Click modular router in 
terms of Knit components to illustrate the expressive- 
ness and flexibility of our language. This paper provides 
an overview of the Knit language and its applications. 


1 Components for Systems Software 


Software components can reduce development time by 
providing programmers with prepackaged chunks of 
reusable code. The key to making software components 
work is to define components that are general enough 
to be useful in many contexts, but simple enough that 
programmers can understand and use the components’ 
interfaces. 

Historically, developers have seen great success with 
components only in the limited form of libraries. The 
implementor of a library provides services to unknown 
clients, but builds on top of an existing, known layer 
of services. For example, the X11 library builds on 
the C library. A component implementor, in contrast, 
provides services to an unknown client while simulta- 
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neously importing services from an unknown supplier. 
Such components are more flexible than libraries, be- 
cause they are more highly parameterized, but they are 
also more difficult to implement and link together. De- 
spite the difficulty of implementing general components, 
an ever-growing pressure to reuse code drives the de- 
velopment of general component collections such as the 
OSKit [10]. 


Existing compilers and linkers for systems software 
provide poor support for components because these tools 
are designed for library-based software. For example, to 
reference external interfaces, a client source file must re- 
fer to a specific implementation’s header files, instead of 
declaring only the services it needs. Compiled objects 
refer to imports within a global space of names, implic- 
itly requiring all clients that need a definition for some 
name to receive the same implementation of that name. 
Also, to mitigate the performance penalty of abstraction, 
library header files often include specific function im- 
plementations to be inlined into client code. All of these 
factors tend to tie client code to specific library imple- 
mentations, rather than allowing the client to remain ab- 
stract with respect to the services it requires. 


A programmer can fight the system, and—by careful 
use of #include redirection, preprocessor magic, and 
name mangling in object files—manage to keep code ab- 
stracted from its suppliers. Standard programming tools 
offer the programmer little help, however, and the bur- 
den of ensuring that components are properly linked is 
again left to the programmer. This is unfortunate, con- 
sidering that component interfaces are inherently more 
complex than library interfaces. Indeed, attempting to 
use such techniques while developing the OSKit has 
been a persistent source of problems, both for ourselves 
as developers and for OSKit users. 


We have developed a new module language and 
toolset for managing systems components called Knit. 
Knit is based on units [8,9], a model of components in 
the spirit of the Mesa [23] and Modula-3 [14] module 
languages. In addition to bringing state-of-the-art mod- 
ule technology to C programs, Knit provides features of 
particular use in the design and implementation of com- 
plex, low-level systems: 
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e Knit provides automatic scheduling of component 
initialization and finalization, even in the presence 
of mutual dependencies among components. This 
scheduling is possible because each component de- 
scribes, in addition to its import requirements, spe- 
cific initialization requirements. 


e Knit’s constraint-checking system allows program- 
mers to define domain-specific architectural invari- 
ants and to check that systems built with Knit sat- 
isfy those invariants. For example, we have used 
Knit to check that code executing without a process 
context will never call code that requires a process 
context. 


e Knit can inline functions across component bound- 
aries, thus reducing one of the basic performance 
overheads of componentization and encouraging 
smaller and more reusable components. 


We have specifically designed Knit so that linking spec- 
ifications are static, and so that Knit tools can operate on 
components in source form as well as compiled form. 
Although dynamic linking and separate compilation fit 
naturally within our core component model, our imme- 
diate interests lie elsewhere. We are concerned with 
low-level systems software that is inherently static and 
amenable to global analysis after it is configured, but 
where flexibility and assurance are crucial during the 
configuration stage. 

Knit’s primary target application is the OSKit, a col- 
lection of components for building custom operating 
systems and extending existing systems. Knit is not 
OSKit-specific, however. As an additional example 
for Knit, we implemented part of MIT’s Click modu- 
lar router [25] in terms of Knit components, showing 
how Knit can help express both Click’s component im- 
plementations and its linking language. 

In the following sections we explain the problems 
with existing linking tools (Section 2) and present our 
improved language (Section 3), including its constraint 
system for detecting component mismatches (Section 4). 
We describe our initial experience with Knit in the OS- 
Kit and a subset of Click (Section 5). We then describe 
our preliminary work on reducing the performance over- 
head of componentization (Section 6). Finally, we de- 
scribe related work (Section 7). 


2 Linking Components 


With existing technology, the two main options for struc- 
turing component-based, low-level systems are to im- 
plement components as object files linked by 1d (the 
standard Unix linker), or to implement components as 
objects in an object-oriented language (or, equivalently, 
COM objects). Neither is satisfactory from the point of 
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Figure 1: Linking with 1d 


view of component kits; the reasons given in Section 2.1 
and Section 2.2 reflect our experience in trying each 
within the OSKit. Though our experience with standard 
linking may be unsurprising, our analysis helps to illu- 
minate the parts of our Knit linking model, developed 
specifically for component programming, described in 
Section 2.3. 


2.1 Conventional Linking 


Figure 1(a) illustrates the way that a typical program is 
linked, through the eyes of an expert who understands 
the entire program. Each puzzle piece in the figure rep- 
resents an object (.0) file. A tab on a puzzle piece is 
a global variable or function provided by the object. A 
notch in a puzzle piece is a global variable or function 
used by the object that must be defined elsewhere in the 
program. Differently shaped tabs and notches indicate 
differently named variables and functions. 

The balloon on the left-hand side represents the col- 
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lection of object files that are linked to create the pro- 
gram. The programmer provides this collection of files 
to the linker as a grab bag of objects. The right-hand 
side of the figure shows the linker’s output. The linker 
matches all of the tabs with notches, fitting the whole 
puzzle together in the obvious way. 

This bag-of-objects approach to linking is flexible. A 
programmer can reuse any puzzle piece in any other pro- 
gram, as long as the piece’s tabs and notches match other 
pieces’ notches and tabs. Similarly, any piece of the 
original puzzle can be replaced by a different piece as 
long as it has the same tabs and notches. Linkers sup- 
port various protocols for “overriding” a puzzle piece in 
certain bags (i.e., archive files) without having to modify 
the bag itself. In all cases, the tedious task of matching 
tabs and notches is automated by the linker. 

The bag-of-objects approach to linking has a num- 
ber of practical drawbacks, however. Figure 1(b) illus- 
trates the same linking process as Figure 1(a), but this 
time more realistically, through the eyes of a program- 
mer who is new to the program. Whereas the expert 
imagines the program to be composed of well-defined 
pieces that fit together in an obvious way, the actual 
code contains many irrelevant tabs and notches (e.g., a 
“global” variable that is actually local to that compo- 
nent, or a spurious and unused extern declaration) that 
obscure a piece’s role in the overall program. Indeed, 
some edges are neither clearly tabs nor clearly notches 
(e.g., an uninitialized global variable might implement 
an export or an import). If the new programmer wishes 
to replace a piece of the puzzle, it may not be clear which 
tabs and notches of the old piece must be imitated by the 
new piece, and which tabs and notches of the old piece 
were mere implementation artifacts. 

The bag-of-objects approach also has a significant 
technical limitation when creating new programs from 
existing pieces. Figure 1(c) illustrates the limitation: a 
new component is to be interposed between two existing 
components, perhaps to log all calls between the top- 
right component and the top-left component. To achieve 
this interposition, the tabs and notches of the bottom 
piece have the same shapes as the tabs and notches of 
the top pieces. Now, however, the bag of objects does 
not provide enough linking information to allow 1d to 
resolve the ambiguous tabs and notches. (Should the 
linker build a two-piece or a three-piece puzzle?) The 
programmer will be forced to modify at least two of 
the components (perhaps with preprocessor tricks) to 
change the shape of some tabs and notches. 


2.2 Object-Based Linking 


At the opposite end of the spectrum from 1d, com- 
ponents can be implemented as objects in an object- 
oriented language or framework. In this approach, the 


links among components are defined by arbitrary code 
that passes object references around. This is the view of 
components implemented by object-based frameworks 
such as COM [22] and CORBA [26], and by some lan- 
guages such as Limbo [6], which relies heavily on dy- 
namic (module-oriented) linking. 

Although linking via arbitrary run-time code is espe- 
cially flexible, it is too dynamic for most uses of com- 
ponents in systems software. Fundamentally, object- 
oriented constructs are ill-suited for organizing code at 
the module level [7, 30]. Although classes and objects 
elegantly express run-time concepts, such as files and 
network connections, they do not provide the structure 
needed by programmers (and analysis tools) to organize 
and understand the static architecture of a program. 

Symptoms of misusing objects as components include 
the late discovery of errors, difficulty in tracing the 
source of link errors, a performance overhead due to vir- 
tual function calls, and a high programmer overhead in 
terms of manipulating reference counts. Code for link- 
ing components is intermingled with regular program 
statements, making the code difficult for both humans 
and machines to analyze. Even typechecking is of lim- 
ited use, since object-based code uses many dynamic 
typechecks (i.e., downcasts) to verify that components 
have the expected types, and must be prepared to recover 
if this is not so. These problems all stem from using a 
dynamic mechanism (objects) to build systems in which 
the connections between components change rarely, if 
ever, after the system is configured and initialized. 

In short, object-based component languages offer lit- 
tle help to the programmer in ensuring that components 
are linked together properly. While objects can serve 
a useful and important role in implementing data struc- 
tures, they do as much harm as good at the component 
level. 


2.3. Unit Linking 


The linking model for units [8,9] eschews the bag of ob- 
jects in favor of explicit, programmer-directed linking. 
It also avoids the excessive dynamism and intractable 
analysis of object-based linking by keeping the link- 
ing specification separate from (and simpler than) the 
core programming language. The model builds on pi- 
oneering research for component-friendly modules in 
Mesa [23], functors in ML [21], and generic packages 
in Modula-3 [14] and Ada95 [18]. 

Linking with units includes specific linking instruc- 
tions that connect each notch to its matching tab. The 
linking specification may be hierarchical, in that a sub- 
set of the objects can be linked to form a larger object (or 
puzzle piece), which is then available for further linking. 

Unit linking can thus express the use pattern in 
Figure l(c) that is impossible with 1d. Furthermore, 
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unlike object-based linking, a program’s explicit link- 
ing specification helps programmers understand the in- 
terface of each component and the role of each compo- 
nent in the overall program. The program’s linking hi- 
erarchy serves as a roadmap to guide a new programmer 
through the program structure. 

Unit linking also extends more naturally to cross- 
component optimization than do 1d or object-based link- 
ing. The interfaces and linking graph for a program 
can be specified in advance, before any of the individ- 
ual components are compiled. The compiler can then 
combine the linking graph with the source code for sur- 
rounding components to specialize the compilation of an 
individual component. The linking hierarchy may also 
provide a natural partitioning of components into groups 
to be compiled with cross-component optimization, thus 
limiting the need to know the entire program to perform 
optimizations. 

The static nature of unit linking specifications makes 
them amenable to various forms of analysis, such as en- 
suring that components are linked in a way that satis- 
fies certain type and specification constraints. For ex- 
ample, details on the adaptation of expressive type lan- 
guages (such as that of ML) to units can be found in Flatt 
and Felleisen’s original units paper [9]. This support for 
static analysis provides a foundation for applying current 
and future research to systems components. 


3 Units for C 


In this section we describe our unit model in more de- 
tail, especially as it applies to C code. We first look at 
a simplified model that covers component imports, ex- 
ports, and linking. We then refine the model to address 
the complications of real code, including initialization 
constraints. 


3.1 Simplified Model 


Our linking model consists of two kinds of units: atomic 
units, which are like the smallest puzzle pieces, and com- 
pound units, which are like puzzle pieces that contain 
other puzzle pieces. Figure 2 expands the model of a 
unit given in Section 2.3 to a more concrete represen- 
tation for a unit implemented in C.! According to this 
representation, every atomic unit has three parts: 


1. A set of imports (the top part of the box), which 
are the names of functions and variables that will 
be supplied to the unit by another unit. 


2. A set of exports (the bottom part of the box), which 
are the names of functions and variables that are 
defined by the unit and provided for use by other 
units. 


'Knit actually relies on a textual language for unit descriptions, as 
shown in Section 3.3. 


4th Symposium on Operating Systems Design and Implementation 


3. A set of top-level C declarations (the middle part of 
the box), which must include a definition for each 
exported name, and may include uses of each im- 
ported name. Defined names that are not exported 
will be hidden from all other units. 


The example unit in Figure 2 shows a component within 
a Web server, as it might be implemented with the 
OSKit. The component exports a serve_web func- 
tion that inspects a given URL and dispatches to either 
serve_file or serve_cgi, depending on whether the 
URL refers to a file or CGI script. 

Atomic units are linked together to form compound 
units, as illustrated in Figure 3. A compound unit has 
a set of imports (the top part of the outer box) that can 
be propagated to the imports of units linked to form the 
compound unit. The compound unit explicitly specifies 
how imports are propagated to other units; these propa- 
gations can be visualized as arrows. A compound unit 
also has a set of exports (the bottom part of the outer 
box) that are drawn from the exports of the units linked 
to form the compound unit. The compound unit explic- 
itly specifies which exports are to be propagated. Be- 
cause all connections are explicitly specified, arrows can 
connect imports and exports with different names, al- 
lowing each unit to use locally meaningful names with- 
out the danger of clashes in a global namespace. 

The imports of the linked units that are not satisfied by 
imports of the compound unit must be satisfied by link- 
ing them to the exports of other units within the com- 
pound unit. As before, the compound unit defines these 
links. The units linked together in a compound unit need 
not be atomic units; they can be compound units as well. 

The example in Figure 3 links the previous example 
unit with another unit that logs requested URLs. The 
original serve_web function is wrapped with a new 
one, serve_logged, to perform the logging. The re- 
sulting compound unit still requires serve_file and 
serve_cgi to be provided by other units, and also re- 
quires functions for manipulating files. The compound 
unit’s export is the logged version of serve_web. 


3.2 Realistic Model 


To make units practical for real systems code, we must 
enhance the simple unit model in a number of ways. 
Figure 4 shows a more realistic model of units in Knit. 

First, instead of importing and exporting individual 
function names, Knit units import and export names in 
bundles. For example, the stdio bundle groups fopen, 
fprintf, and many other functions. Grouping names 
into bundles makes unit definitions more concise and 
lets programmers define components in terms of stan- 
dardized bundles. 

Second, the simplified model shows source code in- 
lined in the unit’s definition, but it is more practical to 
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General shape: 


import, ... importm 


C code that 


defines the exports 
and uses the imports. 


export, ... export, 





Example: 


serve_cgi serve_file 


int serve_web(...) { 
if. Ci) 
serve_cgi(...); 


else 
serve_file(...); 





serve_web 


Figure 2: A unit implemented in C, ideally 


Example: 


wmport, ... import, 


Linking graph over 
specific units: 





export, ... exportn 





define units by referring to one or more external C files.? 
To convert the source code to compiled object files, Knit 
needs both the source files and their compilation flags. 
Figure 4 shows how the logging component’s content is 
created by compiling log.c using the include directory 
oskit/include. 

Third, realistic systems components have complex 
initialization dependencies. If there were no cyclic im- 
port relations among components, then initializations 
could be scheduled according to the import graph. In 
practice, however, cyclic imports are common, so the 
programmer must occasionally provide fine-grained de- 
pendency information to break cycles. A Knit unit there- 
fore provides an explicit declaration of the unit’s ini- 
tialization functions, plus information about the depen- 
dencies of exports and initializers on imports. Based 
on these declarations, Knit automatically schedules calls 
to component initializers. Finalizers are treated analo- 
gously to initializers, but are called after the correspond- 
ing exports are no longer needed. 

For example, the logging unit in Figure 4 defines an 
open_log function to initialize the component and a 
close_log function to finalize it. The functions ex- 


2Knit can actually work with C, assembly, and object code. Extend- 
ing Knit to handle C++, or any other language that compiles to .o with 
C-like conventions, would be straightforward but time-consuming. 


Y : 
serve_cgi 


serve.web | J 
7 
serve_web 


Figure 3: A compound unit, ideally 
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t 
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ported in the serveLog bundle are declared to call the 
functions in the imported serveWeb and stdio bun- 
dles, and the initialization and finalization functions 
open_log and close_log rely only on the functions 
in the stdio bundle. 


The open_log and close_log dependency declara- 
tions reveal a subtlety in declaring initialization con- 
straints. The declaration “serveLog needs stdio” in- 
dicates that stdio must be initialized before any func- 
tion in the bundle serveLog is called. However, this 
declaration alone does not constrain the order of initial- 
ization between the logging component and the standard 
I/O component; it simply says that both must be initial- 
ized before a serveLog function is used. In contrast, 
the declaration “open_log needs stdio” ensures that 
the standard I/O component is initialized before the log- 
ging component, because the logging component’s ini- 
tialization relies on standard V/O functions. The distinc- 
tion between dependency levels is crucial to avoid over- 
constraining the initialization order. 


A final feature needed by real units is that imports 
and exports may need to be renamed in order to as- 
sociate Knit symbols with the identifiers used in the 
actual (C) implementation of a unit. For example, a 
serial console implementation might define a function 
serial_putchar, but export it as putchar to match a 
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A bundle is a collection of names to import or 
export. Each bundle is itself named. 


SS 


importBundle, ... 


Information for obtaining 
a.o file: e.g., the name of 
a.c file, the flags for 
compiling it, and mappings 
between Knit symbols and 
C identifiers. 


Used by Knit 
to get the unit’s 
implementation. 


The .o file defines all 
exports, initializers, and 
finalizers. 


ezportBundle, ... 


Example: serveWeb: {serve_web} 


files: log.c 
flags: -Ioskit/include 


rename serveWeb.serve_web to serve_unlogged 
rename serveLog.serve_web to serve_logged 





import Bundlem 


initializer function 

for each export 

(optional) 

Used by Knit to 
schedule automatic 
calls to initializers 
and finalizers. 


finalizer function 
for each export 
(optional) 


{initial,final} ization 
dependencies of 
exports on imports 


exportBundle, 


stdio:{fopen, fprintf, ...} 


open_log initializes serveLog 


| close_log finalizes serveLog 
| 


! serveLog needs serveWeb, stdio 
! open_log needs stdio 
| close_log needs stdio 


serveLog: {serve_web} 





Figure 4: A unit implemented in C, more realistically. It exports a bundle serveLog containing the single function serve_web. 


more generic unit interface. Another example would be 
a unit that both imports and exports a particular bundle 
type, a pattern that occurs frequently in units designed 
to “wrap” or interpose on other units. The logging unit 
shown in Figure 4 is such a unit; the implementation— 
the code in log.c—must be able to distinguish between 
the imported serve_web and the exported serve_web 
functions. This distinction is made by renaming the im- 
port or the export (or both) so that the functions have 
different names in the C code. 


3.3. Example Code 


Due to space constraints, we omit a full description of 
the Knit syntax. Nevertheless, to give a sense of Knit’s 
current concrete syntax, we show how to express the run- 
ning example.? For the sake of exposition and maintain- 
ing correspondence with the pictures, we have avoided 

3The syntax continues to evolve as we gain experience. Also, al- 


though we do not currently have a graphical tool for Knit, we are con- 
sidering implementing one in the future. 


some syntactic sugar that can shorten real unit defini- 
tions. 

Figure 5 shows Knit declarations for the Web server 
and logging components, plus a compound unit linking 
them together. Before defining the units, the code de- 
fines bundle types Serve and Stdio (artificially brief in 
this example) and a set of compiler flags, CFlags. These 
declarations are used within the unit definitions. 

As in the graphical notation in Figure 4, the Web 
unit in Figure 5 imports two functions, one for serv- 
ing files and another for serving CGI scripts. The nota- 
tion for both imports and exports declares a local name 
within the unit (the left hand side of the colon) and 
specifies the type of the bundle (the right hand side 
of the colon). This local name can be used in subse- 
quent statements within the unit. Furthermore, all of 
the exports (serveWeb) depend on all of the imports 
(serveFile and serveCGI). The unit’s implementation 
is in the file web.c, in Figure 6. The rename declara- 
tions resolve the conflict between importing and export- 
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bundletype Serve = { serve_web } 
bundletype Stdio = { fopen, fprintf } 
flags CFlags = { "-Ioskit/include" } 


unit Web = { 
imports [ serveFile : Serve, 
serveCGI : Serve ]; 
exports [ serveWeb : Serve ]; 
depends { 


serveWeb needs (serveFile + serveCGI) ; 
}; 
files { "web.c" } with flags CFlags; 
rename { 
serveFile.serve_web to serve_file; 
serveCGI.serve_web to serve_cgi; 


}; 


unit Log = { 
imports [ serveWeb : Serve, 
stdio : Stdio ]; 
exports [ serveLog : Serve ]; 
initializer open_log for serveLog; 
finalizer close_log for serveLog; 
depends { 
(open_log + close_log) needs stdio; 
serveLog needs (serveWeb + stdio) ; 
3; 
files { "log.c" } with flags CFlags; 
rename { 
serveWeb.serve_web to serve_unlogged; 
serveLog.serve_web to serve_logged ; 
}; 
} 


unit LogServe = { 
imports [ serveFile : Serve, 
serveCGI : Serve, 
stdio : Stdio ]; 
exports [ serveLog : Serve ]; 
link { 
[serveWeb] <- Web <- [serveFile,serveCGI] ; 
[serveLog] <- Log <- [serveWeb,stdio]; 
}; 
} 


Figure 5: Unit descriptions for parts of a Web server 


ing three functions with the same name by mapping the 
imported serve_web identifiers in bundles serveFile 
and serveCGI onto the C identifiers serve_file and 
serve_cgi. 

The Log unit imports a serve_web function plus a 
bundle of I/O functions, and exports a serve_web func- 
tion. The initialization and finalization declarations pro- 
vide the same information as the graphical version of the 


web.c: 
err_t serve_web(socket_t s, char *path) f{ 
if (!strncmp(path, "/cgi-bin/", 9)) 
return serve_cgi(s, path + 9); 
else 
return serve_file(s, path); 


} 


log.c: 
static FILE *log; 


void open_log() { 
log = fopen("ServerLog", "a"); 


} 


err_t serve_logged(socket_t s, char *path) { 
int ©; 
r = serve_unlogged(s, path); 
fprintf(log, "4%s -> %d\n", path, r); 
return r; 


} 


Figure 6: Unit internals for parts of a Web server 


unit in Figure 4. The dependency declarations specify 
that the functions open_log and close_1log call func- 
tions in the stdio bundle, and all of the exports depend 
on all of the imports. Finally, the rename declarations 
resolve the conflict between the imported and exported 
serve_web functions, this time by renaming both the 
imported and ex ported versions. 


The LogServe compound unit links the Web and Log 
units together, propagating imports and exports as in 
Figure 3. Specifically, in the link section, 


[serveWeb] <- Web <- [serveFile,serveCGI] 


instantiates a Web server unit using the serveFile and 
serveCGI imports, and binds the Web unit’s exported 
bundle to the local name serveWeb. The next line speci- 
fies that serveWeb is animport, along with stdio, when 
instantiating the Log unit. The exported bundle of the 
Log unit is bound to serveLog, which is also used in 
the exports declaration of the compound unit, indicat- 
ing that Log’s exports are propagated as exports from 
the compound unit. 

Figure 6 shows the C code implementing the Web 
and Log units. Only a few details have been omitted, 
such as the #include lines. Within web.c, the names 
serve_cgi and serve_file refer to imported func- 
tions, and serve_web is the exported function. Simi- 
larly, in log.c, serve_unlogged, fopen, and fprintf 
are all imports, while serve_logged is an export and 
open_log is the initializer. 
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4 Checking Architectural Constraints 


Beyond making components easier to describe and link, 
Knit is designed to enable powerful analysis and opti- 
mization tools for componentized systems code. In this 
sense, Knit serves as a bridge between low-level imple- 
mentation techniques and high-level program analysis 
techniques. 

Component kits, especially, need analysis tools to 
help ensure that components are assembled correctly — 
much more than libraries and fixed architectures need 
such tools. In the case of a fixed (but extensible) archi- 
tecture, a programmer can learn to code by a certain set 
of rules and to debug each extension until it seems to 
follow the rules. In the case of a component kit as flexi- 
ble as the OSKit, however, the rules of proper construc- 
tion change depending on which components are linked 
together. For example, in a kernel using a “null im- 
plementation” of threads, components need not provide 
re-entrant procedures because the “null implementation” 
keeps all execution single-threaded. But when the thread 
component is replaced with an actual implementation of 
threads, the rules for proper construction of the system 
suddenly require re-entrant procedures. 

These kinds of problems fall outside the scope of con- 
ventional checking tools such as static type systems. 
Type systems in most programming languages (includ- 
ing C, C++, etc.) express concrete properties of code, 
such as data representation and function calling conven- 
tions, and do not express abstract properties like dead- 
lock avoidance or whether code is in the top half or 
bottom half of a device driver. More importantly, con- 
ventional type systems detect local errors, but the prob- 
lems that occur in component software are often global 
errors, where each individual component composition 
may be correct, but the entire system is wrong. 

To start exploring the space of possible analyses over 
component-based programs, we have included in Knit 
a simple, extensible constraint system. This system al- 
lows a programmer to define properties that Knit should 
check, and then lets the programmer annotate each unit 
declaration with the properties it satisfies.* 

As an example property, consider the distinction be- 
tween “top half” code, which includes functions like 
pthread_lock or sleep that require a process context, 
and “bottom half” code, which includes interrupt han- 
dlers that work without a context. We would like Knit 
to enforce the constraint that bottom-half code does not 
directly call top-half code, an error that might happen 
when a set of components is wired together incorrectly. 


‘Besides their value as checkable properties, constraints provide 
useful documentation of the component’s behavior. Indeed, in our ex- 
perience so far, constraints often duplicate information provided infor- 
Maily in documentation. 
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We can define this property in Knit with the following 
declarations: 


property context 
type NoContext 
type ProcessContext < NoContext 


which declare a property context and its two pos- 
sible values, with a partial ordering on the property 
values that indicates NoContext is more general than 
ProcessContext. 

Given this definition, a programmer can annotate im- 
ports and exports in units to establish property con- 
straints. The examples below illustrate the three most 
common forms of annotation: 


context (pthread_lock) <= ProcessContext 
context (panic) >= NoContext 
context (printf) <= context (putchar) 


These three forms of constraint indicate that (1) a func- 
tion (in this case, pthread_lock) requires a process 
context; (2) a function (e.g., panic) must work in situa- 
tions where there is no process context; and (3) a func- 
tion (e.g., printf) cannot be more flexible than some 
other function (e.g., putchar, which is used to imple- 
ment printf). Note that the last form of constraint al- 
lows the constraints of one component to be propagated 
through other components in a chain of links. In prac- 
tice, we find that such propagation of constraints appears 
most often, since most components are flexible enough 
to adapt to many constraint environments. 

When components are linked together, Knit analyzes 
the components’ constraints and reports an error if the 
constraints cannot be satisfied for some property (or if an 
expected constraint declaration is missing). When Knit 
reports a property failure, it displays the shortest chain of 
constraints that demonstrates the source of the problem. 


5 Experience 


So far, we have applied Knit to two different sets of com- 
ponents: (1) the OSKit [10], a large set of components 
that includes many legacy components, and (2) a partial 
implementation of the Click modular router [25], com- 
prising a few new and cleanly constructed components. 

As reported in the following sections, our experience 
has been positive, but with two caveats. First, the cur- 
rent implementation of Knit is a prototype, and the only 
users to date are its implementors. Second, even the im- 
plementors are unsatisfied with the current Knit syntax, 
which leads to linking specifications that seem exces- 
sively verbose for many tasks. 


5.1 Knit and the OSKit 


The OSKit is a collection of components for building 
operating systems. Rather than defining a fixed structure 
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for an operating system, the OSKit provides raw mate- 
rials for implementing whatever system structure a user 
has in mind. OSKit components can be combined in 
endless ways, and users are expected to write their own 
extensions and replacements for many kinds of compo- 
nents, depending on the needs of their designs. The 
components range in size from large, such as a TCP/IP 
stack derived from FreeBSD (over 18000 lines of non- 
blank/non-comment/etc. code), to small, such as serial 
console support (less than 200 lines of code). To use 
these components, OSKit users—many of whom know 
little about operating systems—must understand the in- 
terface of each component, including its functional de- 
pendencies and its initialization dependencies. 

Before Knit: In the initial version of the OSKit, each 
component was implemented by one or more object (.0) 
files, which were stored in library archives, linked via 
1d. A component could be replaced by providing a re- 
placement object file/library before the original library 
in the 1d linking line. Since 1d inspects its arguments in 
order, and since it ignores archive members that do not 
contribute new symbols (referenced by previously used 
objects), a careful ordering of 1d’s arguments would al- 
low a programmer to override an existing component. 

As the OSKit grew in size and user base, experience 
soon revealed the deficiencies of 1d as a component- 
linking tool. As depicted in Figure 1(c), interposition 
on component interfaces was difficult. Similarly, com- 
ponents that provided different implementations of the 
same interface would clash in the global namespace used 
for linking by 1d. Even just checking that the linked set 
of components matched the intended set was difficult. 

To address these issues (and, orthogonally, to repre- 
sent run-time objects such as open files), a second ver- 
sion of the OSKit introduced COM abstractions for many 
kinds of components. For example, the system console, 
thread blocking, memory allocation, and interrupt han- 
dling are all implemented by COM components in the 
OSKit. For convenience, these COM objects are typi- 
cally stored in a central “registry component.” 

Although adding CoM interfaces to the OSKit solved 
many of the technical issues with 1d linking, in some 
ways it worsened the usability problems. Programmers 
who had successfully used the simple function interfaces 
in the original OSKit at first rebelled at having to set up 
seemingly gratuitous objects and indirections. Program- 
mers became responsible for getting reference counts 
right and for linking objects together by explicitly pass- 
ing pointers among COM instances. In practice, merely 
getting the reference counting right was a significant bar- 
rier to experimenting with new system configurations. 
Furthermore, inconvenient COM interfaces proved con- 
tagious. For example, to support a kernel in which dif- 
ferent parts of the system use different memory pools, 


the memory allocator component had to be made a COM 
object. This required changes to all code that uses allo- 
cators, changes to the code that inserts objects into the 
registry, and careful tweaking of the initialization order 
to try to ensure that objects in the registry were allocated 
with and subsequently used the correct allocators. 

With both 1d and COM, component linking problems 
interfere with the main purpose of the OSKit, which is to 
be a vehicle for quick experimentation. The motivation 
for Knit is to eliminate these problems, allowing pro- 
grammers to specify which components to link together 
as directly as possible. 

After Knit: We have converted approximately 250 
components—about half of the OSKit—and about 20 
example kernels to Knit. The process of developing 
Knit declarations for OSKit components revealed many 
properties and interactions among the components that 
a programmer would not have been able to learn from 
the documentation alone. Annotating a component took 
anywhere from 15 minutes (typically) to a full day 
(rarely), depending partly on the complexity of the com- 
ponent and its initialization requirements but mostly on 
the quality of the documentation (e.g., whether the im- 
ports and exports were clear). 

Using Knit, we can now easily build systems that we 
could not build before without undue effort. For ex- 
ample, OSKit device drivers generate output by call- 
ing printf, which is also used for application output. 
Redirecting device driver output without Knit requires 
creating two separate copies of printf, then renam- 
ing printf calls in the device drivers either through 
cut-and-paste (a maintenance problem) or preprocessor 
magic (a delicate operation). Interposing on functions 
requires similar tricks. Such low-tech solutions work 
well enough for infrequent operations on a small set 
of names, but they do not scale to component environ- 
ments in which configuration changes are frequent. Us- 
ing Knit, interposition and configuration changes can be 
implemented and tested in just a few minutes. 

Knit’s automatic scheduling of initialization code was 
a significant aid in exploring kernel configurations. In 
a monolithic or fixed-framework kernel, an expert pro- 
grammer can write a carefully devised function that calls 
all initializers in the right order, once and for all. This 
is not an option in the OSKit, where the correct order 
depends on which components are glued together. Pre- 
vious versions of the OSKit provided canned initializa- 
tion sequences, but, as just described, using these se- 
quences would limit the programmer’s control over the 
components used in the configuration. Knit allows the 
expert to annotate components with their dependencies 
and allows client programmers to combine precisely the 
components they want with reliable initialization. Anno- 
tations for device drivers, filesystems, networking, con- 
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sole, and other intertwined components have proven rel- 
atively easy to get right at the local level, and the sched- 
uler has performed remarkably well in practice. 

The constraint system described in Section 4 caught 
a few small errors in existing OSKit kernels, writ- 
ten by ourselves, OSKit experts. We added con- 
straints to kernels composed of roughly 100 units. 
Among those units, 35 required the addition of 
constraints, of which 70% simply propagated their 
context from imports to exports using the con- 
straint “context (exports) <= context(imports)” 
or stated that a component could be used without a pro- 
cess context. These required little effort. The remainder 
(device drivers and thread packages) required more care 
because we had to examine the source code to determine 
how individual components were used. The errors we 
found were easy to fix once identified. The advantage 
of Knit is that its constraint system found the bugs, and 
will continue to detect new bugs as the code evolves. 

A further benefit of using Knit is that it makes it easier 
to create small, special-purpose kernels. The combina- 
tion of knowing exactly which components are in our 
kernels (and why) and the ease of replacing one compo- 
nent with another enabled us to dramatically reduce the 
size of some kernels. An extreme example is our small- 
est kernel (the toy hello_world kernel) which is four 
times smaller when built with Knit than without. 

The Knit version of the OSKit continues to use COM 
for subsystems that behave more like objects than mod- 
ules. For example, individual files and directories are 
still implemented as COM objects. 


5.2 Clack 


The elegant Click modular router [25] allows a program- 
mer, or even a network administrator, to build a special- 
purpose router by wiring together a set of components. 
Click provides its own language for configuring routers, 
so that a programmer might write 


FromDevice(eth) -> Counter -> Discard 


to create a “router” that counts packets. 

Click is implemented in C++, and each router compo- 
nent is implemented by a C++ class instance. A pro- 
grammer can add new kinds of router components to 
Click by deriving new C++ classes. To demonstrate that 
Knit is general and more than just a tool for the OSKit, 
we implemented a subset of Click version 1.0.1 with 
Knit components instead of C++ classes.> We dubbed 
our new component suite Clack. 

5Because Click’s router components are generally very small and 
functionally simple, much of the actual component source code deals 
with the Click-specific component framework and not with the func- 
tional purpose of the components. For this reason, we decided to write 


our components from scratch rather than adapt the existing Click com- 
ponents to Knit. 
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Given Click as a model, implementing enough of 
Clack in Knit to build an fP router (without handling 
fragmentation or IP options) took a few days. A typ- 
ical Clack component required several lines of C plus 
several lines of unit description. Clack follows the ba- 
sic architecture of Click, but the details have been Knit- 
ified. For example, Click supports component initial- 
ization through user-provided strings. Clack emulates 
this feature with trivial components that provide initial- 
ization data. Similarly, Click’s support for a (configure- 
time) variable number of imports or exports is handled in 
Clack with appropriate fan-in and fan-out components. 
Clack does not emulate the more dynamic aspects of 
Click, such as allowing a component to locate certain 
other components at run time. 

Overall, by avoiding the syntactic overhead required 
to retrofit C++ classes as components, Clack defini- 
tions are considerably more compact than corresponding 
Click definitions (by roughly a factor of three for small 
components). The size of Clack .o files was smaller than 
Click .o’s by an even more dramatic amount (roughly a 
factor of seven for small components). This is mainly 
due to Clack’s fine-grained control of the router’s con- 
tent and Click’s support for dynamic composition. The 
overall performance of Clack is comparable to that of 
Click. 

In contrast, using the full Knit linking language to join 
Clack components is more complex than using Click’s 
special-purpose language. If Clack were to be used 
by network administrators, we would certainly build a 
(straightforward) translator from Click linking specifi- 
cations to Knit linking expressions. 

Based on our small experiment, we believe that Knit 
would have been a useful tool for implementing the orig- 
inal Click component set. The Click architecture fits 
well in the Knit language model, and the Click configu- 
ration language is conceptually close to the Knit linking 
model. The one aspect of Click that does not fit well 
into Knit is the rapid deployment of new configurations. 
Click configurations consist of C++ object graphs that 
can be dynamically generated, whereas Clack configu- 
rations are resolved at link time. Note, however, that 
recent work on Click performance by its authors also 
conflicts with dynamic configuration [19]. 

To the extent that Knit is a bridge to analyses and 
optimizations, we believe that Knit would be a supe- 
rior implementation environment for Click compared 
to C++. In Section 6, we report on cross-component 
optimizations in Knit, and we show that they substan- 
tially increase the performance of Clack. The constraint- 
checking facilities of Knit can also be used to enforce 
configuration restrictions among Clack components, en- 
suring, for example, that components only receive pack- 
ets of an appropriate type (Ethernet, IP, TCP, ARP, etc.). 
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These analyses are only a start, and a Knit-based Click 
would be able to exploit future Knit developments. 


6 Implementation and Performance 


Component software tends to have worse performance 
than monolithic software. Introducing component 
boundaries invariably increases the number of function 
calls in a program and hides opportunities for optimiza- 
tion. However, Knit’s static linking language allows it 
to eliminate these costs. Indeed, we can achieve useful 
levels of optimization while exploiting the existing in- 
frastructure of compilers and linkers. 

In a typical use, the Knit compiler reads the link- 
ing specification and unit files, generates initialization 
and finalization code, runs the C compiler or assembler 
when necessary, and ultimately produces object files. 
The object files are then processed by a slightly modi- 
fied version of GNU’s obj copy, which handles renam- 
ing symbols and duplicating object code for multiply- 
instantiated units. Finally, these object files are linked 
together using 1d to produce the program. 

To verify that Knit does not impose an unacceptable 
overhead on programs, we timed Knit-based OSKit pro- 
grams that were designed to spend most of their time 
traversing unit boundaries. We compared these pro- 
grams with equivalent OSKit programs built using tra- 
ditional tools. The number of units in the critical path 
ranged between 3 and 8 (including units such as memory 
file systems, VGA device drivers, and memory alloca- 
tors), with the total number of units between 37 and 72. 
Tests were run between 10 anda million times, as appro- 
priate. Knit was from 2% slower to 3% faster, +0.25%. 
Note that these experiments were done without applying 
the optimization that we describe next. 

For cross-component optimization, we have imple- 
mented a strategy that is deceptively simple to describe: 
Knit merges the code from many different C files into a 
single file, and then invokes the C compiler on the re- 
sulting file. The task of merging C code is simple but 
tedious; Knit must rename variables to eliminate con- 
flicts, eliminate duplicate declarations for variables and 
types, and sort function definitions so that the definition 
of each function comes before as many uses as possible 
(to encourage inlining in the C compiler). Fortunately, 
these complexities are minor compared to building an 
optimizing compiler. To limit the size of the file pro- 
vided to the compiler, Knit can merge files at any unit 
boundary, as directed by the programmer via the unit 
specifications. When used in conjunction with the GNU 
C compiler (which has poor interprocedural optimiza- 
tion), this enables functions to be inlined across com- 
ponent boundaries which may, in turn, enable further in- 
traprocedural optimizations such as constant folding and 


common subexpression elimination.® 

To test the effectiveness of Knit’s optimization tech- 
nique (which we call flattening), we applied it to our 
Clack IP router. Since our focus was on the structure 
of the router, we flattened only the router rather than the 
entire kernel. For comparison, we rewrote our router 
components in a less modular way: combining 24 sepa- 
rate components into just 2 components, converting the 
result to idiomatic C, and eliminating redundant data 
fetches. The most important measure of an optimiza- 
tion is, of course, the time the optimized program takes. 
In this case, we also measured the impact of stalls in 
the instruction fetch unit because there is a risk that 
the inlining enabled by flattening would increase the 
size of the router code, leading to poor I-cache per- 
formance.’ Our experiments were performed on three 
200 MHz Pentium Pro machines, each with 64MB of 
RAM and 256 KB of L2 cache, directly connected via 
DEC Tulip 10/100 Ethernet cards, with the “machine in 
the middle” functioning as the IP router. 

The results are shown in Table 1. The manual trans- 
formation gives a significant (21%) performance im- 
provement, demonstrating that componentization can 
have significant overhead. Flattening the modular ver- 
sion of the router gives an even more significant (35%) 
improvement: rather than harming I-cache behavior, 
flattening greatly improves I-cache behavior. Examina- 
tion of the assembly code reveals that flattening elim- 
inates function call overhead (e.g., the cost of pushing 
arguments onto the stack), turns function call nests into 
compact straight-line code, and eliminates redundant 
reads via common subexpression elimination. Combin- 
ing both optimizations gives only a small (5%) addi- 
tional improvement in performance, suggesting that the 
optimizations obtain their gains from the same source. 
Our overall conclusion is that we can eliminate most of 
the cost of componentization by blindly merging code, 
enabling conventional optimizing compilers to do the 
rest. 

Meanwhile, the authors of Click have been working 
on special-purpose optimizations for their system [19]. 
Their optimizations include a “fast classifier” that gener- 
ates specialized versions of generic components, a “‘spe- 
cializer” that makes indirect function calls direct, and 
an “xform” step that recognizes certain patterns of com- 
ponents and replaces them with faster ones. While their 
code base and optimizations are very different from ours, 
the relative performance of their system and the effec- 
tiveness of their optimizations provides a convenient 
touchstone for our results. The performance of their base 


®We used gcc version 2.95.2 for all our experiments. 

7We also measured number of instruction misses in the L1 and L2 
caches: both the overall downward trend and the approximate ratio 
between the three numbers were the same across all experiments. 
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text size 
(bytes) 
109464 
108246 
106065 
106305 


flattened 









hand instr. fetch 
optimized stall cycles 


781 








Table 1: Clack router performance using various optimiza- 
tions, measured in number of cycles fromthe moment a packet 
enters the router graph to the moment it leaves. I-fetch stalls 
were measured using the Pentium Pro counters and are re- 
ported in cycles. The i-fetch stall numbers along with the given 
code sizes reveal that inlining did not have a negative effect. 


version cycles 


unoptimized 
optimized 





Table 2: Click router performance, with and without all three 
MIT optimizations, measured as above. The Click routers were 
executed in the same OSKit-derived kernel and on the same 
hardware as the Clack routers. 


and optimized systems is shown in Table 2. We note that 
the performance of their base system is approximately 
the same as ours (3% slower) but that the effect of ap- 
plying all three Click optimizations is significantly bet- 
ter than the two Clack optimizations (54%). Considering 
that Knit achieves its performance increase by blindly 
merging code, without any profiling or tuning of Clack 
by programmers, we again interpret the results of our ex- 
periment to indicate that Knit would make a good imple- 
mentation platform for Click-like systems. We believe 
that Knit would save implementors of such systems time 
and energy implementing basic optimizations, allowing 
them to concentrate on implementing domain-specific or 
application-specific optimizations. 

The core of our current Knit compiler prototype con- 
sists of about 6000 lines of Haskell code, of which 
roughly 500 lines implement initializers and finaliz- 
ers, 500 lines implement constraints, and 1500 lines 
implement flattening. Our prototype implementation 
is acceptably fast—more than 95% of build time is 
spent in the C compiler and linker—although constraint- 
checking more than doubles the time taken to run Knit. 


7 Related Work 


Much of the early research in component-based systems 
software involves the design and implementation of mi- 
crokernels such as Mach [1] and Spring [13]. With an 
emphasis on robustness and architectures for flexibil- 
ity and extensibility at subsystem boundaries, microker- 
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nel research is essentially complementary to research on 
component implementation and composition tools like 
Knit. More recently, the Pebble [11] microkernel-based 
OS has an emphasis on flexibly combining components 
in protection domains, e.g., in separate servers or in a 
single protection domain. However, the actual set of 
components is quite fixed. 

More closely related research involves the use of com- 
ponent kits for building systems software. MMLite [16] 
followed our OSKit lead by providing a variety of COM- 
based components for building low-level systems. MM- 
Lite takes a very aggressive approach to componentiza- 
tion and provides certain features that the current OSKit 
and Knit lack, such as the ability to replace system com- 
ponents at run-time. The Scout operating system [24] 
and its antecedent x-kernel [17] consist of a modest num- 
ber of modules that can be combined to create software 
for “network appliances” and protocols. The Click sys- 
tem [25] also focuses on networking, but specifically tar- 
gets packet routers (e.g., IP routers) and its components 
are much smaller than Scout’s. Scout and Click, like 
Knit, rely on “little languages” outside of C to build 
and optimize component compositions, and to sched- 
ule component initializations, but only Knit provides a 
general-purpose language that can be used to describe 
both new and existing components. Languages like C++ 
and Java have also dealt with automatic initialization of 
static variables, but through complex (and, in the case of 
C++, unpredictable) rules that give the programmer little 
control. 

Knit’s ability to work with unmodified C code distin- 
guishes it from projects such as Fox [15] and Ensem- 
ble [20], which rely on a high-level implementation lan- 
guage, or systems such as pSOS [31] and the currently 
very small eCos [29]. Both eCos and pSOS provide 
configuration languages/interfaces but neither really has 
“components”: individual subsystems can be included 
or excluded from the system, but there is no way to 
change the interconnections between components. In 
contrast to Ensemble, it should be noted that Knit pro- 
vides a more “lightweight” system for reasoning about 
component compositions. This is a deliberate choice: 
we intend for Knit to be usable by systems programmers 
without training in formal methods. 

Like Knit, OMOS [27,28] enables the reuse of existing 
code through interface conversion on object files (e.g., 
renaming a symbol, or wrapping a set of functions). Un- 
like Knit, however, OMOS does not provide a configura- 
tion language that is conducive to static analysis. 

Commercial tools, such as Visual Basic and tools us- 
ing it, help programmers design, browse, and link soft- 
ware components that conform to the COM or CORBA 
standards. Such tools and frameworks currently lack 
the kinds of specification information that would en- 
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able automated checking of component configurations 
beyond datatype interfaces, and the underlying object- 
based model of components makes cross-component op- 
timization exceedingly difficult. 

Knit’s unit language is derived from the compo- 
nent model of Flatt and Felleisen [9], who provide an 
extensive overview of related programming languages 
research. Module work that subsequently achieved 
similar goals includes the recursive functors of Crary 
et al. [5] and the typed-assembly linker of Glew and 
Morrisett [12]. 

CM [3] solves for ML many of the same problems 
that Knit solves for C, but ML provides CM with a pre- 
existing module language and acore language with well- 
defined semantics. Unlike Knit, CM disallows recursive 
modules, thus sidestepping initialization issues. Further, 
CM relies on ML’s type system to perform consistency 
checks, instead of providing its own constraint system. 

The lazy functional programming community rou- 
tinely uses higher-order functions to glue small com- 
ponents into larger components, relies on sophisticated 
type systems to detect errors in component composition, 
and makes use of lazy evaluation to dynamically de- 
termine initialization order in cyclic component graphs. 
For example, the Fudgets GUI component library [4] 
uses a dataflow model similar to the one used in Click 
(and Clack) but the data sent between elements can have 
a variety of types (e.g., menu selections, button clicks, 
etc.) instead of a single type (e.g., packets). This ap- 
proach has been refined and applied to different domains 
by a variety of authors. 

The GenVoca model of software components [2] has 
several similarities to our work: their “realms” corre- 
spond to our bundle types, their “type equations” cor- 
respond to our linking graphs, and their “design rules” 
correspond to our constraint systems. In its details, how- 
ever, the GenVoca approach is quite different from ours. 
GenVoca is based on the notion of a generator, which 
is a compiler for a domain-specific language. GenVoca 
components are program transformations rather than 
containers of code; a GenVoca compiler therefore syn- 
thesizes code from a high-level (and domain-specific) 
program description. In contrast, Knit promotes the 
reuse of existing (C) code and enables flexible compo- 
sition through its separate, unit-based linking language. 
Notably, Knit allows cyclic component connections— 
important in many systems—and can check constraints 
in such graphs. The GenVoca model of components and 
design rules, however, is based on (non-cyclic) compo- 
nent trees. 


8 Conclusion 


From our experiences in building and using OSKit com- 
ponents, and that of our clients in using them, we believe 


that existing tools do not adequately address the needs of 
componentized software. To fill the gap, we have devel- 
oped Knit, a language for defining and linking systems 
components. Our initial experiments in applying Knit to 
the OSKit show that Knit provides improved support for 
component programming. 

The Knit language continues to evolve, and future 
work will focus on making components and linking 
specifications easier to define. In particular, we plan to 
generalize the constraint-checking mechanism to reduce 
repetition between different constraints and, we hope, 
to unify scheduling of initializers with constraint check- 
ing. We may also explore support for dynamic linking, 
where the main challenge involves the handling of con- 
straint Specifications at dynamic boundaries. Continued 
exploration of Knit within the OSKit will likely produce 
improvements to the language and increase our under- 
standing of how systems components should be struc- 
tured. 

Knitis a first step in a larger research program to bring 
strong analysis and optimization techniques to bear on 
componentized systems software. We expect such tools 
to help detect deadlocks, detect unsafe locking, reduce 
abstraction overheads, flatten layered implementations, 
and more. All of these tasks require the well-defined 
component boundaries and static linking information 
provided by Knit. We believe that other researchers and 
programmers who are working on componentized sys- 
tems could similarly benefit by using Knit. 


Availability 
Source and documentation for our Knit prototype is 
available under http: //www.cs.utah.edu/flux/. 
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